Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Andrew Morton @ 2018-04-19 19:47 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, bhutchings, David Miller, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804191207380.31175@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:

> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
> 
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.

Yes, that's nasty.

> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> 
> ...
>
> --- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
> +++ linux-2.6/mm/util.c	2018-04-18 16:00:43.000000000 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> +#ifndef CONFIG_DEBUG_VM
>  	gfp_t kmalloc_flags = flags;
>  	void *ret;
>  
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
>  	 */
>  	if (ret || size <= PAGE_SIZE)
>  		return ret;
> +#endif
>  
>  	return __vmalloc_node_flags_caller(size, node, flags,
>  			__builtin_return_address(0));

Well, it doesn't have to be done at compile-time, does it?  We could
add a knob (in debugfs, presumably) which enables this at runtime. 
That's far more user-friendly.

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-19 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, bhutchings, David Miller, Vlastimil Babka
In-Reply-To: <20180419124751.8884e516e99825d83da3d87a@linux-foundation.org>



On Thu, 19 Apr 2018, Andrew Morton wrote:

> On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> > 
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> > 
> > These bugs are hard to reproduce because vmalloc falls back to kmalloc
> > only if memory is fragmented.
> 
> Yes, that's nasty.
> 
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > 
> > ...
> >
> > --- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
> > +++ linux-2.6/mm/util.c	2018-04-18 16:00:43.000000000 +0200
> > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> >   */
> >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> >  {
> > +#ifndef CONFIG_DEBUG_VM
> >  	gfp_t kmalloc_flags = flags;
> >  	void *ret;
> >  
> > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> >  	 */
> >  	if (ret || size <= PAGE_SIZE)
> >  		return ret;
> > +#endif
> >  
> >  	return __vmalloc_node_flags_caller(size, node, flags,
> >  			__builtin_return_address(0));
> 
> Well, it doesn't have to be done at compile-time, does it?  We could
> add a knob (in debugfs, presumably) which enables this at runtime. 
> That's far more user-friendly.

But who will turn it on in debugfs? It should be default for debugging 
kernels, so that users using them would report the error.

Conditioning it on CONFIG_DEBUG_SG is better than CONFIG_DEBUG_VM, it will 
print a stacktrace where the incorrect use happened.

Mikulas

^ permalink raw reply

* [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-19 21:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: dm-devel, eric.dumazet, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, Andrew Morton, David Miller, Vlastimil Babka
In-Reply-To: <20180419193554-mutt-send-email-mst@kernel.org>



On Thu, 19 Apr 2018, Michael S. Tsirkin wrote:

> Maybe make it conditional on CONFIG_DEBUG_SG too?
> Otherwise I think you just trigger a hard to debug memory corruption.

OK, here I resend the patch with CONFIG_DEBUG_SG. With CONFIG_DEBUG_SG, 
the DMA API will print a stacktrace where the misuse happened, so it's 
much easier to debug than with CONFIG_DEBUG_VM.

Fedora doesn't use CONFIG_DEBUG_SG in its default kernel (it only uses it 
in the debugging kernel), so users won't be hurt by this.



From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.

Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.

These bugs are hard to reproduce because vmalloc falls back to kmalloc
only if memory is fragmented.

In order to detect these bugs reliably I submit this patch that changes
kvmalloc to always use vmalloc if CONFIG_DEBUG_SG is turned on.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/util.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
+++ linux-2.6/mm/util.c	2018-04-19 23:14:14.000000000 +0200
@@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
  */
 void *kvmalloc_node(size_t size, gfp_t flags, int node)
 {
+#ifndef CONFIG_DEBUG_SG
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
@@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
 	 */
 	if (ret || size <= PAGE_SIZE)
 		return ret;
+#endif
 
 	return __vmalloc_node_flags_caller(size, node, flags,
 			__builtin_return_address(0));

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Andrew Morton @ 2018-04-19 23:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, bhutchings, David Miller, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804191716100.10099@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, 19 Apr 2018 17:19:20 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:

> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > > 
> > > ...
> > >
> > > --- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
> > > +++ linux-2.6/mm/util.c	2018-04-18 16:00:43.000000000 +0200
> > > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > >   */
> > >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > >  {
> > > +#ifndef CONFIG_DEBUG_VM
> > >  	gfp_t kmalloc_flags = flags;
> > >  	void *ret;
> > >  
> > > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > >  	 */
> > >  	if (ret || size <= PAGE_SIZE)
> > >  		return ret;
> > > +#endif
> > >  
> > >  	return __vmalloc_node_flags_caller(size, node, flags,
> > >  			__builtin_return_address(0));
> > 
> > Well, it doesn't have to be done at compile-time, does it?  We could
> > add a knob (in debugfs, presumably) which enables this at runtime. 
> > That's far more user-friendly.
> 
> But who will turn it on in debugfs?

But who will turn it on in Kconfig?  Just a handful of developers.  We
could add SONFIG_DEBUG_SG to the list in
Documentation/process/submit-checklist.rst, but nobody reads that.

Also, a whole bunch of defconfigs set CONFIG_DEBUG_SG=y and some
googling indicates that they aren't the only ones...

> It should be default for debugging 
> kernels, so that users using them would report the error.

Well.  This isn't the first time we've wanted to enable expensive (or
noisy) debugging things in debug kernels, by any means.

So how could we define a debug kernel in which it's OK to enable such
things?

- Could be "it's an -rc kernel".  But then we'd be enabling a bunch of
  untested code when Linus cuts a release.

- Could be "it's an -rc kernel with SUBLEVEL <= 5".  But then we risk
  unexpected things happening when Linux cuts -rc6, which still isn't
  good.

- How about "it's an -rc kernel with odd-numbered SUBLEVEL and
  SUBLEVEL <= 5".  That way everybody who runs -rc1, -rc3 and -rc5 will
  have kvmalloc debugging enabled.  That's potentially nasty because
  vmalloc is much slower than kmalloc.  But kvmalloc() is only used for
  large and probably infrequent allocations, so it's probably OK.

I wonder how we get at SUBLEVEL from within .c.  

^ permalink raw reply

* [PATCH net-next v7 0/4] Enable virtio_net to act as a standby for a passthru device
From: Sridhar Samudrala @ 2018-04-20  1:42 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_STANDBY that can be
used by hypervisor to indicate that virtio_net interface should act as
a standby for another device with the same MAC address.

Patch 2 introduces a failover module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual 
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When STANDBY feature is enabled, virtio_net driver creates
an additional 'failover' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'standby' netdev and a passthru/vf device with the same MAC gets
registered as 'primary' netdev. Both 'standby' and 'primary' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'failover' netdev. The 'failover' netdev chooses 'primary'
netdev as default for transmits when it is available with link up and
running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by failover module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'primary' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v7
- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
  (jiri, mst)
- re-arranged dev_open() and dev_set_mtu() calls in the register routines
  so that they don't get called for 2-netdev model. (stephen)
- fixed select_queue() routine to do queue selection based on VF if it is
  registered as primary. (stephen)
-  minor bugfixes

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
  net: Introduce generic failover module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the failover
    framework

 drivers/net/Kconfig             |   1 +
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 208 +++-------
 drivers/net/virtio_net.c        |  38 +-
 include/linux/netdevice.h       |  16 +
 include/net/failover.h          |  96 +++++
 include/uapi/linux/virtio_net.h |   3 +
 net/Kconfig                     |  18 +
 net/core/Makefile               |   1 +
 net/core/failover.c             | 844 ++++++++++++++++++++++++++++++++++++++++
 11 files changed, 1070 insertions(+), 158 deletions(-)
 create mode 100644 include/net/failover.h
 create mode 100644 net/core/failover.c

-- 
2.14.3

^ permalink raw reply

* [PATCH v7 net-next 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
From: Sridhar Samudrala @ 2018-04-20  1:42 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-1-git-send-email-sridhar.samudrala@intel.com>

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..6f95719ede40 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2962,7 +2962,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..a3715a3224c1 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related

* [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Sridhar Samudrala @ 2018-04-20  1:42 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-1-git-send-email-sridhar.samudrala@intel.com>

This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation.

It exposes 2 sets of interfaces to the paravirtual drivers.
1. existing netvsc driver that uses 2 netdev model. In this model, no
master netdev is created. The paravirtual driver registers each instance
of netvsc as a 'failover' instance  along with a set of ops to manage the
slave events.
     failover_register()
     failover_unregister()
2. new virtio_net based solution that uses 3 netdev model. In this model,
the failover module provides interfaces to create/destroy additional master
netdev and all the slave events are managed internally.
      failover_create()
      failover_destroy()
These functions call failover_register()/failover_unregister() with the
master netdev created by the failover module.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/linux/netdevice.h |  16 +
 include/net/failover.h    |  96 ++++++
 net/Kconfig               |  18 +
 net/core/Makefile         |   1 +
 net/core/failover.c       | 844 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 975 insertions(+)
 create mode 100644 include/net/failover.h
 create mode 100644 net/core/failover.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..ed535b6724e1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1401,6 +1401,8 @@ struct net_device_ops {
  *	entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
+ * @IFF_FAILOVER: device is a failover master device
+ * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1430,6 +1432,8 @@ enum netdev_priv_flags {
 	IFF_PHONY_HEADROOM		= 1<<24,
 	IFF_MACSEC			= 1<<25,
 	IFF_NO_RX_HANDLER		= 1<<26,
+	IFF_FAILOVER			= 1<<27,
+	IFF_FAILOVER_SLAVE		= 1<<28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1458,6 +1462,8 @@ enum netdev_priv_flags {
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
 #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
+#define IFF_FAILOVER			IFF_FAILOVER
+#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -4308,6 +4314,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
 }
 
+static inline bool netif_is_failover(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_FAILOVER;
+}
+
+static inline bool netif_is_failover_slave(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_FAILOVER_SLAVE;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/failover.h b/include/net/failover.h
new file mode 100644
index 000000000000..0b8601043d90
--- /dev/null
+++ b/include/net/failover.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_FAILOVER_H
+#define _NET_FAILOVER_H
+
+#include <linux/netdevice.h>
+
+struct failover_ops {
+	int (*slave_pre_register)(struct net_device *slave_dev,
+				  struct net_device *failover_dev);
+	int (*slave_join)(struct net_device *slave_dev,
+			  struct net_device *failover_dev);
+	int (*slave_pre_unregister)(struct net_device *slave_dev,
+				    struct net_device *failover_dev);
+	int (*slave_release)(struct net_device *slave_dev,
+			     struct net_device *failover_dev);
+	int (*slave_link_change)(struct net_device *slave_dev,
+				 struct net_device *failover_dev);
+	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct failover {
+	struct list_head list;
+	struct net_device __rcu *failover_dev;
+	struct failover_ops __rcu *ops;
+};
+
+/* failover state */
+struct failover_info {
+	/* primary netdev with same MAC */
+	struct net_device __rcu *primary_dev;
+
+	/* standby netdev */
+	struct net_device __rcu *standby_dev;
+
+	/* primary netdev stats */
+	struct rtnl_link_stats64 primary_stats;
+
+	/* standby netdev stats */
+	struct rtnl_link_stats64 standby_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 failover_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
+#if IS_ENABLED(CONFIG_NET_FAILOVER)
+
+int failover_create(struct net_device *standby_dev,
+		    struct failover **pfailover);
+void failover_destroy(struct failover *failover);
+
+int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
+		      struct failover **pfailover);
+void failover_unregister(struct failover *failover);
+
+int failover_slave_unregister(struct net_device *slave_dev);
+
+#else
+
+static inline
+int failover_create(struct net_device *standby_dev,
+		    struct failover **pfailover);
+{
+	return 0;
+}
+
+static inline
+void failover_destroy(struct failover *failover)
+{
+}
+
+static inline
+int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
+		      struct pfailover **pfailover);
+{
+	return 0;
+}
+
+static inline
+void failover_unregister(struct failover *failover)
+{
+}
+
+static inline
+int failover_slave_unregister(struct net_device *slave_dev)
+{
+	return 0;
+}
+
+#endif
+
+#endif /* _NET_FAILOVER_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..388b99dfee10 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
 	  devlink is a loadable module and the driver using it is built-in.
 
+config NET_FAILOVER
+	tristate "Failover interface"
+	help
+	  This provides a generic interface for paravirtual drivers to listen
+	  for netdev register/unregister/link change events from pci ethernet
+	  devices with the same MAC and takeover their datapath. This also
+	  enables live migration of a VM with direct attached VF by failing
+	  over to the paravirtual datapath when the VF is unplugged.
+
+config MAY_USE_FAILOVER
+	tristate
+	default m if NET_FAILOVER=m
+	default y if NET_FAILOVER=y || NET_FAILOVER=n
+	help
+	  Drivers using the failover infrastructure should have a dependency
+	  on MAY_USE_FAILOVER to ensure they do not cause link errors when
+	  failover is a loadable module and the driver using it is built-in.
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..cef17518bb7d 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_NET_FAILOVER) += failover.o
diff --git a/net/core/failover.c b/net/core/failover.c
new file mode 100644
index 000000000000..7bee762cb737
--- /dev/null
+++ b/net/core/failover.c
@@ -0,0 +1,844 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <linux/pci.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/failover.h>
+
+static LIST_HEAD(failover_list);
+static DEFINE_SPINLOCK(failover_lock);
+
+static int failover_slave_pre_register(struct net_device *slave_dev,
+				       struct net_device *failover_dev,
+				       struct failover_ops *failover_ops)
+{
+	struct failover_info *finfo;
+	bool standby;
+
+	if (failover_ops) {
+		if (!failover_ops->slave_pre_register)
+			return -EINVAL;
+
+		return failover_ops->slave_pre_register(slave_dev,
+							failover_dev);
+	}
+
+	finfo = netdev_priv(failover_dev);
+	standby = (slave_dev->dev.parent == failover_dev->dev.parent);
+	if (standby ? rtnl_dereference(finfo->standby_dev) :
+			rtnl_dereference(finfo->primary_dev)) {
+		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
+			   slave_dev->name, standby ? "standby" : "primary");
+		return -EEXIST;
+	}
+
+	/* Avoid non pci devices as primary netdev */
+	if (!standby && (!slave_dev->dev.parent ||
+			 !dev_is_pci(slave_dev->dev.parent)))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int failover_slave_join(struct net_device *slave_dev,
+			       struct net_device *failover_dev,
+			       struct failover_ops *failover_ops)
+{
+	struct failover_info *finfo;
+	int err, orig_mtu;
+	bool standby;
+
+	if (failover_ops) {
+		if (!failover_ops->slave_join)
+			return -EINVAL;
+
+		return failover_ops->slave_join(slave_dev, failover_dev);
+	}
+
+	if (netif_running(failover_dev)) {
+		err = dev_open(slave_dev);
+		if (err && (err != -EBUSY)) {
+			netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
+				   slave_dev->name, err);
+			goto err_dev_open;
+		}
+	}
+
+	/* Align MTU of slave with failover dev */
+	orig_mtu = slave_dev->mtu;
+	err = dev_set_mtu(slave_dev, failover_dev->mtu);
+	if (err) {
+		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
+			   slave_dev->name, failover_dev->mtu);
+		goto err_set_mtu;
+	}
+
+	finfo = netdev_priv(failover_dev);
+	standby = (slave_dev->dev.parent == failover_dev->dev.parent);
+
+	dev_hold(slave_dev);
+
+	if (standby) {
+		rcu_assign_pointer(finfo->standby_dev, slave_dev);
+		dev_get_stats(finfo->standby_dev, &finfo->standby_stats);
+	} else {
+		rcu_assign_pointer(finfo->primary_dev, slave_dev);
+		dev_get_stats(finfo->primary_dev, &finfo->primary_stats);
+		failover_dev->min_mtu = slave_dev->min_mtu;
+		failover_dev->max_mtu = slave_dev->max_mtu;
+	}
+
+	netdev_info(failover_dev, "failover slave:%s joined\n",
+		    slave_dev->name);
+
+	return 0;
+
+err_set_mtu:
+	dev_close(slave_dev);
+err_dev_open:
+	return err;
+}
+
+/* Called when slave dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t failover_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
+{
+	struct net_device *failover_dev;
+	struct failover *failover;
+
+	spin_lock(&failover_lock);
+	list_for_each_entry(failover, &failover_list, list) {
+		failover_dev = rtnl_dereference(failover->failover_dev);
+		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
+			*ops = rtnl_dereference(failover->ops);
+			spin_unlock(&failover_lock);
+			return failover_dev;
+		}
+	}
+	spin_unlock(&failover_lock);
+	return NULL;
+}
+
+static int failover_slave_register(struct net_device *slave_dev)
+{
+	struct failover_ops *failover_ops;
+	struct net_device *failover_dev;
+	int ret;
+
+	ASSERT_RTNL();
+
+	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
+	if (!failover_dev)
+		goto done;
+
+	ret = failover_slave_pre_register(slave_dev, failover_dev,
+					  failover_ops);
+	if (ret)
+		goto done;
+
+	ret = netdev_rx_handler_register(slave_dev, failover_ops ?
+					 failover_ops->handle_frame :
+					 failover_handle_frame, failover_dev);
+	if (ret) {
+		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
+			   ret);
+		goto done;
+	}
+
+	ret = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
+	if (ret) {
+		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
+			   failover_dev->name, ret);
+		goto upper_link_failed;
+	}
+
+	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
+
+	ret = failover_slave_join(slave_dev, failover_dev, failover_ops);
+	if (ret)
+		goto err_join;
+
+	call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
+
+	netdev_info(failover_dev, "failover slave:%s registered\n",
+		    slave_dev->name);
+
+	goto done;
+
+err_join:
+	netdev_upper_dev_unlink(slave_dev, failover_dev);
+	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
+upper_link_failed:
+	netdev_rx_handler_unregister(slave_dev);
+done:
+	return NOTIFY_DONE;
+}
+
+static int failover_slave_pre_unregister(struct net_device *slave_dev,
+					 struct net_device *failover_dev,
+					 struct failover_ops *failover_ops)
+{
+	struct net_device *standby_dev, *primary_dev;
+	struct failover_info *finfo;
+
+	if (failover_ops) {
+		if (!failover_ops->slave_pre_unregister)
+			return -EINVAL;
+
+		return failover_ops->slave_pre_unregister(slave_dev,
+							  failover_dev);
+	}
+
+	finfo = netdev_priv(failover_dev);
+	primary_dev = rtnl_dereference(finfo->primary_dev);
+	standby_dev = rtnl_dereference(finfo->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int failover_slave_release(struct net_device *slave_dev,
+				  struct net_device *failover_dev,
+				  struct failover_ops *failover_ops)
+{
+	struct net_device *standby_dev, *primary_dev;
+	struct failover_info *finfo;
+
+	if (failover_ops) {
+		if (!failover_ops->slave_release)
+			return -EINVAL;
+
+		return failover_ops->slave_release(slave_dev, failover_dev);
+	}
+
+	finfo = netdev_priv(failover_dev);
+	primary_dev = rtnl_dereference(finfo->primary_dev);
+	standby_dev = rtnl_dereference(finfo->standby_dev);
+
+	if (slave_dev == standby_dev) {
+		RCU_INIT_POINTER(finfo->standby_dev, NULL);
+	} else {
+		RCU_INIT_POINTER(finfo->primary_dev, NULL);
+		if (standby_dev) {
+			failover_dev->min_mtu = standby_dev->min_mtu;
+			failover_dev->max_mtu = standby_dev->max_mtu;
+		}
+	}
+
+	dev_put(slave_dev);
+
+	netdev_info(failover_dev, "failover slave:%s released\n",
+		    slave_dev->name);
+
+	return 0;
+}
+
+int failover_slave_unregister(struct net_device *slave_dev)
+{
+	struct failover_ops *failover_ops;
+	struct net_device *failover_dev;
+	int ret;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
+	if (!failover_dev)
+		goto done;
+
+	ret = failover_slave_pre_unregister(slave_dev, failover_dev,
+					    failover_ops);
+	if (ret)
+		goto done;
+
+	netdev_rx_handler_unregister(slave_dev);
+	netdev_upper_dev_unlink(slave_dev, failover_dev);
+	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
+
+	failover_slave_release(slave_dev, failover_dev, failover_ops);
+
+	netdev_info(failover_dev, "failover slave:%s unregistered\n",
+		    slave_dev->name);
+
+done:
+	return NOTIFY_DONE;
+}
+EXPORT_SYMBOL_GPL(failover_slave_unregister);
+
+static bool failover_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static int failover_slave_link_change(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev, *primary_dev, *standby_dev;
+	struct failover_ops *failover_ops;
+	struct failover_info *finfo;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
+	if (!failover_dev)
+		goto done;
+
+	if (failover_ops) {
+		if (!failover_ops->slave_link_change)
+			goto done;
+
+		return failover_ops->slave_link_change(slave_dev, failover_dev);
+	}
+
+	if (!netif_running(failover_dev))
+		return 0;
+
+	finfo = netdev_priv(failover_dev);
+
+	primary_dev = rtnl_dereference(finfo->primary_dev);
+	standby_dev = rtnl_dereference(finfo->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	if ((primary_dev && failover_xmit_ready(primary_dev)) ||
+	    (standby_dev && failover_xmit_ready(standby_dev))) {
+		netif_carrier_on(failover_dev);
+		netif_tx_wake_all_queues(failover_dev);
+	} else {
+		netif_carrier_off(failover_dev);
+		netif_tx_stop_all_queues(failover_dev);
+	}
+
+done:
+	return NOTIFY_DONE;
+}
+
+static bool failover_validate_event_dev(struct net_device *dev)
+{
+	/* Skip parent events */
+	if (netif_is_failover(dev))
+		return false;
+
+	/* Avoid non-Ethernet type devices */
+	if (dev->type != ARPHRD_ETHER)
+		return false;
+
+	return true;
+}
+
+static int
+failover_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	if (!failover_validate_event_dev(event_dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return failover_slave_register(event_dev);
+	case NETDEV_UNREGISTER:
+		return failover_slave_unregister(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return failover_slave_link_change(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block failover_notifier = {
+	.notifier_call = failover_event,
+};
+
+static int failover_open(struct net_device *dev)
+{
+	struct failover_info *finfo = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int err;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	primary_dev = rtnl_dereference(finfo->primary_dev);
+	if (primary_dev) {
+		err = dev_open(primary_dev);
+		if (err)
+			goto err_primary_open;
+	}
+
+	standby_dev = rtnl_dereference(finfo->standby_dev);
+	if (standby_dev) {
+		err = dev_open(standby_dev);
+		if (err)
+			goto err_standby_open;
+	}
+
+	return 0;
+
+err_standby_open:
+	dev_close(primary_dev);
+err_primary_open:
+	netif_tx_disable(dev);
+	return err;
+}
+
+static int failover_close(struct net_device *dev)
+{
+	struct failover_info *finfo = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	netif_tx_disable(dev);
+
+	slave_dev = rtnl_dereference(finfo->primary_dev);
+	if (slave_dev)
+		dev_close(slave_dev);
+
+	slave_dev = rtnl_dereference(finfo->standby_dev);
+	if (slave_dev)
+		dev_close(slave_dev);
+
+	return 0;
+}
+
+static netdev_tx_t failover_drop_xmit(struct sk_buff *skb,
+				      struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static netdev_tx_t failover_start_xmit(struct sk_buff *skb,
+				       struct net_device *dev)
+{
+	struct failover_info *finfo = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via primary netdev followed by standby netdev */
+	xmit_dev = rcu_dereference_bh(finfo->primary_dev);
+	if (!xmit_dev || !failover_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(finfo->standby_dev);
+		if (!xmit_dev || !failover_xmit_ready(xmit_dev))
+			return failover_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16 failover_select_queue(struct net_device *dev, struct sk_buff *skb,
+				 void *accel_priv,
+				 select_queue_fallback_t fallback)
+{
+	struct failover_info *finfo = netdev_priv(dev);
+	struct net_device *primary_dev;
+	u16 txq;
+
+	rcu_read_lock();
+	primary_dev = rcu_dereference(finfo->primary_dev);
+	if (primary_dev) {
+		const struct net_device_ops *ops = primary_dev->netdev_ops;
+
+		if (ops->ndo_select_queue)
+			txq = ops->ndo_select_queue(primary_dev, skb,
+						    accel_priv, fallback);
+		else
+			txq = fallback(primary_dev, skb);
+
+		qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+		return txq;
+	}
+
+	txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32finfot values only.
+ */
+static void failover_fold_stats(struct rtnl_link_stats64 *_res,
+				const struct rtnl_link_stats64 *_new,
+				const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void failover_get_stats(struct net_device *dev,
+			       struct rtnl_link_stats64 *stats)
+{
+	struct failover_info *finfo = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *slave_dev;
+
+	spin_lock(&finfo->stats_lock);
+	memcpy(stats, &finfo->failover_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	slave_dev = rcu_dereference(finfo->primary_dev);
+	if (slave_dev) {
+		new = dev_get_stats(slave_dev, &temp);
+		failover_fold_stats(stats, new, &finfo->primary_stats);
+		memcpy(&finfo->primary_stats, new, sizeof(*new));
+	}
+
+	slave_dev = rcu_dereference(finfo->standby_dev);
+	if (slave_dev) {
+		new = dev_get_stats(slave_dev, &temp);
+		failover_fold_stats(stats, new, &finfo->standby_stats);
+		memcpy(&finfo->standby_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&finfo->failover_stats, stats, sizeof(*stats));
+	spin_unlock(&finfo->stats_lock);
+}
+
+static int failover_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct failover_info *finfo = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int ret = 0;
+
+	primary_dev = rcu_dereference(finfo->primary_dev);
+	if (primary_dev) {
+		ret = dev_set_mtu(primary_dev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	standby_dev = rcu_dereference(finfo->standby_dev);
+	if (standby_dev) {
+		ret = dev_set_mtu(standby_dev, new_mtu);
+		if (ret) {
+			dev_set_mtu(primary_dev, dev->mtu);
+			return ret;
+		}
+	}
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static void failover_set_rx_mode(struct net_device *dev)
+{
+	struct failover_info *finfo = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	rcu_read_lock();
+
+	slave_dev = rcu_dereference(finfo->primary_dev);
+	if (slave_dev) {
+		dev_uc_sync_multiple(slave_dev, dev);
+		dev_mc_sync_multiple(slave_dev, dev);
+	}
+
+	slave_dev = rcu_dereference(finfo->standby_dev);
+	if (slave_dev) {
+		dev_uc_sync_multiple(slave_dev, dev);
+		dev_mc_sync_multiple(slave_dev, dev);
+	}
+
+	rcu_read_unlock();
+}
+
+static const struct net_device_ops failover_dev_ops = {
+	.ndo_open		= failover_open,
+	.ndo_stop		= failover_close,
+	.ndo_start_xmit		= failover_start_xmit,
+	.ndo_select_queue	= failover_select_queue,
+	.ndo_get_stats64	= failover_get_stats,
+	.ndo_change_mtu		= failover_change_mtu,
+	.ndo_set_rx_mode	= failover_set_rx_mode,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+#define FAILOVER_NAME "failover"
+#define FAILOVER_VERSION "0.1"
+
+static void failover_ethtool_get_drvinfo(struct net_device *dev,
+					 struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
+}
+
+int failover_ethtool_get_link_ksettings(struct net_device *dev,
+					struct ethtool_link_ksettings *cmd)
+{
+	struct failover_info *finfo = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	slave_dev = rtnl_dereference(finfo->primary_dev);
+	if (!slave_dev || !failover_xmit_ready(slave_dev)) {
+		slave_dev = rtnl_dereference(finfo->standby_dev);
+		if (!slave_dev || !failover_xmit_ready(slave_dev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(slave_dev, cmd);
+}
+EXPORT_SYMBOL_GPL(failover_ethtool_get_link_ksettings);
+
+static const struct ethtool_ops failover_ethtool_ops = {
+	.get_drvinfo            = failover_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = failover_ethtool_get_link_ksettings,
+};
+
+static void failover_register_existing_slave(struct net_device *failover_dev)
+{
+	struct net *net = dev_net(failover_dev);
+	struct net_device *dev;
+
+	rtnl_lock();
+	for_each_netdev(net, dev) {
+		if (dev == failover_dev)
+			continue;
+		if (!failover_validate_event_dev(dev))
+			continue;
+		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
+			failover_slave_register(dev);
+	}
+	rtnl_unlock();
+}
+
+int failover_register(struct net_device *dev, struct failover_ops *ops,
+		      struct failover **pfailover)
+{
+	struct failover *failover;
+
+	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
+	if (!failover)
+		return -ENOMEM;
+
+	rcu_assign_pointer(failover->ops, ops);
+	dev_hold(dev);
+	dev->priv_flags |= IFF_FAILOVER;
+	rcu_assign_pointer(failover->failover_dev, dev);
+
+	spin_lock(&failover_lock);
+	list_add_tail(&failover->list, &failover_list);
+	spin_unlock(&failover_lock);
+
+	failover_register_existing_slave(dev);
+
+	*pfailover = failover;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(failover_register);
+
+void failover_unregister(struct failover *failover)
+{
+	struct net_device *failover_dev;
+
+	failover_dev = rcu_dereference(failover->failover_dev);
+
+	failover_dev->priv_flags &= ~IFF_FAILOVER;
+	dev_put(failover_dev);
+
+	spin_lock(&failover_lock);
+	list_del(&failover->list);
+	spin_unlock(&failover_lock);
+
+	kfree(failover);
+}
+EXPORT_SYMBOL_GPL(failover_unregister);
+
+int failover_create(struct net_device *standby_dev, struct failover **pfailover)
+{
+	struct device *dev = standby_dev->dev.parent;
+	struct net_device *failover_dev;
+	int err;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that most devices being bonded won't have too many queues.
+	 */
+	failover_dev = alloc_etherdev_mq(sizeof(struct failover_info), 16);
+	if (!failover_dev) {
+		dev_err(dev, "Unable to allocate failover_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(failover_dev, dev_net(standby_dev));
+	SET_NETDEV_DEV(failover_dev, dev);
+
+	failover_dev->netdev_ops = &failover_dev_ops;
+	failover_dev->ethtool_ops = &failover_ethtool_ops;
+
+	/* Initialize the device options */
+	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire failover netdev's netif_tx_lock when transmitting */
+	failover_dev->features |= NETIF_F_LLTX;
+
+	/* Don't allow failover devices to change network namespaces. */
+	failover_dev->features |= NETIF_F_NETNS_LOCAL;
+
+	failover_dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+				    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+				    NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	failover_dev->features |= failover_dev->hw_features;
+
+	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
+	       failover_dev->addr_len);
+
+	failover_dev->min_mtu = standby_dev->min_mtu;
+	failover_dev->max_mtu = standby_dev->max_mtu;
+
+	err = register_netdev(failover_dev);
+	if (err < 0) {
+		dev_err(dev, "Unable to register failover_dev!\n");
+		goto err_register_netdev;
+	}
+
+	netif_carrier_off(failover_dev);
+
+	err = failover_register(failover_dev, NULL, pfailover);
+	if (err < 0)
+		goto err_failover;
+
+	return 0;
+
+err_failover:
+	unregister_netdev(failover_dev);
+err_register_netdev:
+	free_netdev(failover_dev);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(failover_create);
+
+void failover_destroy(struct failover *failover)
+{
+	struct net_device *failover_dev;
+	struct net_device *slave_dev;
+	struct failover_info *finfo;
+
+	if (!failover)
+		return;
+
+	failover_dev = rcu_dereference(failover->failover_dev);
+	finfo = netdev_priv(failover_dev);
+
+	netif_device_detach(failover_dev);
+
+	rtnl_lock();
+
+	slave_dev = rtnl_dereference(finfo->primary_dev);
+	if (slave_dev)
+		failover_slave_unregister(slave_dev);
+
+	slave_dev = rtnl_dereference(finfo->standby_dev);
+	if (slave_dev)
+		failover_slave_unregister(slave_dev);
+
+	failover_unregister(failover);
+
+	unregister_netdevice(failover_dev);
+
+	rtnl_unlock();
+
+	free_netdev(failover_dev);
+}
+EXPORT_SYMBOL_GPL(failover_destroy);
+
+static __init int
+failover_init(void)
+{
+	register_netdevice_notifier(&failover_notifier);
+
+	return 0;
+}
+module_init(failover_init);
+
+static __exit
+void failover_exit(void)
+{
+	unregister_netdevice_notifier(&failover_notifier);
+}
+module_exit(failover_exit);
+
+MODULE_DESCRIPTION("Failover infrastructure/interface for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
-- 
2.14.3

^ permalink raw reply related

* [PATCH v7 net-next 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Sridhar Samudrala @ 2018-04-20  1:42 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-1-git-send-email-sridhar.samudrala@intel.com>

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

It uses the generic failover framework that provides 2 functions to create
and destroy a master failover netdev. When STANDBY feature is enabled, an
additional netdev(failover netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'standby' netdev and a passthru device with the same MAC is
registered as 'primary' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/Kconfig      |  1 +
 drivers/net/virtio_net.c | 36 +++++++++++++++++++++++++++++++++++-
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 891846655000..5abe328973da 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -331,6 +331,7 @@ config VETH
 config VIRTIO_NET
 	tristate "Virtio network driver"
 	depends on VIRTIO
+	depends on MAY_USE_FAILOVER
 	---help---
 	  This is the virtual network driver for virtio.  It can be used with
 	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6f95719ede40..42b9f9bff48b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
 #include <net/route.h>
 #include <net/xdp.h>
+#include <net/failover.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -206,6 +209,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* failover when STANDBY feature enabled */
+	struct failover *failover;
 };
 
 struct padded_vnet_hdr {
@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+				      size_t len)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int ret;
+
+	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
+		return -EOPNOTSUPP;
+
+	ret = snprintf(buf, len, "_sby");
+	if (ret >= len)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
 	.ndo_xdp_flush		= virtnet_xdp_flush,
 	.ndo_features_check	= passthru_features_check,
+	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2839,10 +2862,16 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
+		err = failover_create(vi->dev, &vi->failover);
+		if (err)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_failover;
 	}
 
 	virtio_device_ready(vdev);
@@ -2879,6 +2908,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_failover:
+	failover_destroy(vi->failover);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2913,6 +2944,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	failover_destroy(vi->failover);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
@@ -3010,6 +3043,7 @@ static __init int virtio_net_driver_init(void)
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
+
 	return 0;
 err_virtio:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
-- 
2.14.3

^ permalink raw reply related

* [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Sridhar Samudrala @ 2018-04-20  1:42 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-1-git-send-email-sridhar.samudrala@intel.com>

Use the registration/notification framework supported by the generic
failover infrastructure.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 208 ++++++++++------------------------------
 3 files changed, 55 insertions(+), 156 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..56099d10beed 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@
 config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
+	depends on MAY_USE_FAILOVER
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 960f06141472..d8c2ff698693 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -768,6 +768,8 @@ struct net_device_context {
 	u32 vf_alloc;
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
+
+	struct failover *failover;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..8404c22de32b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
 #include <net/pkt_sched.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
+#include <net/failover.h>
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		if (ether_addr_equal(mac, dev->perm_addr))
-			return dev;
-	}
-
-	return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		struct net_device_context *net_device_ctx;
-
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		net_device_ctx = netdev_priv(dev);
-		if (!rtnl_dereference(net_device_ctx->nvdev))
-			continue;	/* device is removed */
-
-		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-			return dev;	/* a match */
-	}
-
-	return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1829,39 +1790,15 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
 			  struct net_device *ndev)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(ndev);
-	int ret;
-
-	ret = netdev_rx_handler_register(vf_netdev,
-					 netvsc_vf_handle_frame, ndev);
-	if (ret != 0) {
-		netdev_err(vf_netdev,
-			   "can not register netvsc VF receive handler (err = %d)\n",
-			   ret);
-		goto rx_handler_failed;
-	}
-
-	ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
-	if (ret != 0) {
-		netdev_err(vf_netdev,
-			   "can not set master device %s (err = %d)\n",
-			   ndev->name, ret);
-		goto upper_link_failed;
-	}
-
-	/* set slave flag before open to prevent IPv6 addrconf */
-	vf_netdev->flags |= IFF_SLAVE;
 
 	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
-	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
 	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
-	return 0;
 
-upper_link_failed:
-	netdev_rx_handler_unregister(vf_netdev);
-rx_handler_failed:
-	return ret;
+	dev_hold(vf_netdev);
+	rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
+
+	return 0;
 }
 
 static void __netvsc_vf_setup(struct net_device *ndev,
@@ -1914,85 +1851,82 @@ static void netvsc_vf_setup(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_vf_pre_register(struct net_device *vf_netdev,
+				  struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
 
-	if (vf_netdev->addr_len != ETH_ALEN)
-		return NOTIFY_DONE;
-
-	/*
-	 * We will use the MAC address to locate the synthetic interface to
-	 * associate with the VF interface. If we don't find a matching
-	 * synthetic interface, move on.
-	 */
-	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
-		return NOTIFY_DONE;
-
-	if (netvsc_vf_join(vf_netdev, ndev) != 0)
-		return NOTIFY_DONE;
+		return -EEXIST;
 
 	netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
 
-	dev_hold(vf_netdev);
-	rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
-	return NOTIFY_OK;
+	return 0;
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
-	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
-		return NOTIFY_DONE;
+		return -EINVAL;
 
 	netvsc_switch_datapath(ndev, vf_is_up);
 	netdev_info(ndev, "Data path switched %s VF: %s\n",
 		    vf_is_up ? "to" : "from", vf_netdev->name);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_vf_release(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
-	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
+	if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+		return -EINVAL;
 
-	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
 
-	netdev_rx_handler_unregister(vf_netdev);
-	netdev_upper_dev_unlink(vf_netdev, ndev);
 	RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
 	dev_put(vf_netdev);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
+static int netvsc_vf_pre_unregister(struct net_device *vf_netdev,
+				    struct net_device *ndev)
+{
+	struct net_device_context *net_device_ctx;
+
+	net_device_ctx = netdev_priv(ndev);
+	if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+		return -EINVAL;
+
+	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+
+	return 0;
+}
+
+static struct failover_ops netvsc_failover_ops = {
+	.slave_pre_register	= netvsc_vf_pre_register,
+	.slave_join		= netvsc_vf_join,
+	.slave_pre_unregister	= netvsc_vf_pre_unregister,
+	.slave_release		= netvsc_vf_release,
+	.slave_link_change	= netvsc_vf_changed,
+	.handle_frame		= netvsc_vf_handle_frame,
+};
+
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2082,8 +2016,15 @@ static int netvsc_probe(struct hv_device *dev,
 		goto register_failed;
 	}
 
+	ret = failover_register(net, &netvsc_failover_ops,
+				&net_device_ctx->failover);
+	if (ret != 0)
+		goto err_failover;
+
 	return ret;
 
+err_failover:
+	unregister_netdev(net);
 register_failed:
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
@@ -2124,13 +2065,15 @@ static int netvsc_remove(struct hv_device *dev)
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
-		netvsc_unregister_vf(vf_netdev);
+		failover_slave_unregister(vf_netdev);
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
+	failover_unregister(ndev_ctx->failover);
+
 	rtnl_unlock();
 	rcu_read_unlock();
 
@@ -2157,54 +2100,8 @@ static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
-			       unsigned long event, void *ptr)
-{
-	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
-	/* Skip our own events */
-	if (event_dev->netdev_ops == &device_ops)
-		return NOTIFY_DONE;
-
-	/* Avoid non-Ethernet type devices */
-	if (event_dev->type != ARPHRD_ETHER)
-		return NOTIFY_DONE;
-
-	/* Avoid Vlan dev with same MAC registering as VF */
-	if (is_vlan_dev(event_dev))
-		return NOTIFY_DONE;
-
-	/* Avoid Bonding master dev with same MAC registering as VF */
-	if ((event_dev->priv_flags & IFF_BONDING) &&
-	    (event_dev->flags & IFF_MASTER))
-		return NOTIFY_DONE;
-
-	switch (event) {
-	case NETDEV_REGISTER:
-		return netvsc_register_vf(event_dev);
-	case NETDEV_UNREGISTER:
-		return netvsc_unregister_vf(event_dev);
-	case NETDEV_UP:
-	case NETDEV_DOWN:
-		return netvsc_vf_changed(event_dev);
-	default:
-		return NOTIFY_DONE;
-	}
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
-	.notifier_call = netvsc_netdev_event,
-};
-
 static void __exit netvsc_drv_exit(void)
 {
-	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2224,7 +2121,6 @@ static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
-	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Michael S. Tsirkin @ 2018-04-20  2:44 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <1524188524-28411-3-git-send-email-sridhar.samudrala@intel.com>

On Thu, Apr 19, 2018 at 06:42:02PM -0700, Sridhar Samudrala wrote:
> This provides a generic interface for paravirtual drivers to listen
> for netdev register/unregister/link change events from pci ethernet
> devices with the same MAC and takeover their datapath. The notifier and
> event handling code is based on the existing netvsc implementation.
> 
> It exposes 2 sets of interfaces to the paravirtual drivers.
> 1. existing netvsc driver that uses 2 netdev model. In this model, no
> master netdev is created. The paravirtual driver registers each instance
> of netvsc as a 'failover' instance  along with a set of ops to manage the
> slave events.
>      failover_register()
>      failover_unregister()
> 2. new virtio_net based solution that uses 3 netdev model. In this model,
> the failover module provides interfaces to create/destroy additional master
> netdev and all the slave events are managed internally.
>       failover_create()
>       failover_destroy()
> These functions call failover_register()/failover_unregister() with the
> master netdev created by the failover module.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

I like this patch. Yes something to improve (see below)

> ---
>  include/linux/netdevice.h |  16 +
>  include/net/failover.h    |  96 ++++++
>  net/Kconfig               |  18 +
>  net/core/Makefile         |   1 +
>  net/core/failover.c       | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 975 insertions(+)
>  create mode 100644 include/net/failover.h
>  create mode 100644 net/core/failover.c
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index cf44503ea81a..ed535b6724e1 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1401,6 +1401,8 @@ struct net_device_ops {
>   *	entity (i.e. the master device for bridged veth)
>   * @IFF_MACSEC: device is a MACsec device
>   * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
> + * @IFF_FAILOVER: device is a failover master device
> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>   */
>  enum netdev_priv_flags {
>  	IFF_802_1Q_VLAN			= 1<<0,
> @@ -1430,6 +1432,8 @@ enum netdev_priv_flags {
>  	IFF_PHONY_HEADROOM		= 1<<24,
>  	IFF_MACSEC			= 1<<25,
>  	IFF_NO_RX_HANDLER		= 1<<26,
> +	IFF_FAILOVER			= 1<<27,
> +	IFF_FAILOVER_SLAVE		= 1<<28,
>  };
>  
>  #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> @@ -1458,6 +1462,8 @@ enum netdev_priv_flags {
>  #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>  #define IFF_MACSEC			IFF_MACSEC
>  #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
> +#define IFF_FAILOVER			IFF_FAILOVER
> +#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
>  
>  /**
>   *	struct net_device - The DEVICE structure.
> @@ -4308,6 +4314,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>  	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>  }
>  
> +static inline bool netif_is_failover(const struct net_device *dev)
> +{
> +	return dev->priv_flags & IFF_FAILOVER;
> +}
> +
> +static inline bool netif_is_failover_slave(const struct net_device *dev)
> +{
> +	return dev->priv_flags & IFF_FAILOVER_SLAVE;
> +}
> +
>  /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>  static inline void netif_keep_dst(struct net_device *dev)
>  {
> diff --git a/include/net/failover.h b/include/net/failover.h
> new file mode 100644
> index 000000000000..0b8601043d90
> --- /dev/null
> +++ b/include/net/failover.h
> @@ -0,0 +1,96 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2018, Intel Corporation. */
> +
> +#ifndef _NET_FAILOVER_H
> +#define _NET_FAILOVER_H
> +
> +#include <linux/netdevice.h>
> +
> +struct failover_ops {
> +	int (*slave_pre_register)(struct net_device *slave_dev,
> +				  struct net_device *failover_dev);
> +	int (*slave_join)(struct net_device *slave_dev,
> +			  struct net_device *failover_dev);
> +	int (*slave_pre_unregister)(struct net_device *slave_dev,
> +				    struct net_device *failover_dev);
> +	int (*slave_release)(struct net_device *slave_dev,
> +			     struct net_device *failover_dev);
> +	int (*slave_link_change)(struct net_device *slave_dev,
> +				 struct net_device *failover_dev);
> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
> +};
> +
> +struct failover {
> +	struct list_head list;
> +	struct net_device __rcu *failover_dev;
> +	struct failover_ops __rcu *ops;
> +};
> +
> +/* failover state */
> +struct failover_info {
> +	/* primary netdev with same MAC */
> +	struct net_device __rcu *primary_dev;
> +
> +	/* standby netdev */
> +	struct net_device __rcu *standby_dev;
> +
> +	/* primary netdev stats */
> +	struct rtnl_link_stats64 primary_stats;
> +
> +	/* standby netdev stats */
> +	struct rtnl_link_stats64 standby_stats;
> +
> +	/* aggregated stats */
> +	struct rtnl_link_stats64 failover_stats;
> +
> +	/* spinlock while updating stats */
> +	spinlock_t stats_lock;
> +};
> +
> +#if IS_ENABLED(CONFIG_NET_FAILOVER)
> +
> +int failover_create(struct net_device *standby_dev,
> +		    struct failover **pfailover);
> +void failover_destroy(struct failover *failover);
> +
> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
> +		      struct failover **pfailover);
> +void failover_unregister(struct failover *failover);
> +
> +int failover_slave_unregister(struct net_device *slave_dev);
> +
> +#else
> +
> +static inline
> +int failover_create(struct net_device *standby_dev,
> +		    struct failover **pfailover);
> +{
> +	return 0;
> +}
> +
> +static inline
> +void failover_destroy(struct failover *failover)
> +{
> +}
> +
> +static inline
> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
> +		      struct pfailover **pfailover);
> +{
> +	return 0;
> +}
> +
> +static inline
> +void failover_unregister(struct failover *failover)
> +{
> +}
> +
> +static inline
> +int failover_slave_unregister(struct net_device *slave_dev)
> +{
> +	return 0;
> +}
> +
> +#endif
> +
> +#endif /* _NET_FAILOVER_H */
> diff --git a/net/Kconfig b/net/Kconfig
> index 0428f12c25c2..388b99dfee10 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>  	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>  	  devlink is a loadable module and the driver using it is built-in.
>  
> +config NET_FAILOVER
> +	tristate "Failover interface"
> +	help
> +	  This provides a generic interface for paravirtual drivers to listen
> +	  for netdev register/unregister/link change events from pci ethernet
> +	  devices with the same MAC and takeover their datapath. This also
> +	  enables live migration of a VM with direct attached VF by failing
> +	  over to the paravirtual datapath when the VF is unplugged.
> +
> +config MAY_USE_FAILOVER
> +	tristate
> +	default m if NET_FAILOVER=m
> +	default y if NET_FAILOVER=y || NET_FAILOVER=n
> +	help
> +	  Drivers using the failover infrastructure should have a dependency
> +	  on MAY_USE_FAILOVER to ensure they do not cause link errors when
> +	  failover is a loadable module and the driver using it is built-in.
> +
>  endif   # if NET
>  
>  # Used by archs to tell that they support BPF JIT compiler plus which flavour.
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 6dbbba8c57ae..cef17518bb7d 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>  obj-$(CONFIG_HWBM) += hwbm.o
>  obj-$(CONFIG_NET_DEVLINK) += devlink.o
>  obj-$(CONFIG_GRO_CELLS) += gro_cells.o
> +obj-$(CONFIG_NET_FAILOVER) += failover.o
> diff --git a/net/core/failover.c b/net/core/failover.c
> new file mode 100644
> index 000000000000..7bee762cb737
> --- /dev/null
> +++ b/net/core/failover.c
> @@ -0,0 +1,844 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018, Intel Corporation. */


I think you should copy the header from bond_main.c upon which
some of the code seems to be based.

> +
> +/* A common module to handle registrations and notifications for paravirtual
> + * drivers to enable accelerated datapath and support VF live migration.
> + *
> + * The notifier and event handling code is based on netvsc driver.
> + */
> +
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/netdevice.h>
> +#include <linux/netpoll.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if_vlan.h>
> +#include <linux/pci.h>
> +#include <net/sch_generic.h>
> +#include <uapi/linux/if_arp.h>
> +#include <net/failover.h>
> +
> +static LIST_HEAD(failover_list);
> +static DEFINE_SPINLOCK(failover_lock);
> +
> +static int failover_slave_pre_register(struct net_device *slave_dev,
> +				       struct net_device *failover_dev,
> +				       struct failover_ops *failover_ops)
> +{
> +	struct failover_info *finfo;
> +	bool standby;
> +
> +	if (failover_ops) {
> +		if (!failover_ops->slave_pre_register)
> +			return -EINVAL;
> +
> +		return failover_ops->slave_pre_register(slave_dev,
> +							failover_dev);
> +	}
> +
> +	finfo = netdev_priv(failover_dev);
> +	standby = (slave_dev->dev.parent == failover_dev->dev.parent);
> +	if (standby ? rtnl_dereference(finfo->standby_dev) :
> +			rtnl_dereference(finfo->primary_dev)) {
> +		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
> +			   slave_dev->name, standby ? "standby" : "primary");
> +		return -EEXIST;
> +	}
> +
> +	/* Avoid non pci devices as primary netdev */

Why? Pls change this comment so it explains the motivation
rather than just repeat what the code does.

> +	if (!standby && (!slave_dev->dev.parent ||
> +			 !dev_is_pci(slave_dev->dev.parent)))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int failover_slave_join(struct net_device *slave_dev,
> +			       struct net_device *failover_dev,
> +			       struct failover_ops *failover_ops)
> +{
> +	struct failover_info *finfo;
> +	int err, orig_mtu;
> +	bool standby;
> +
> +	if (failover_ops) {
> +		if (!failover_ops->slave_join)
> +			return -EINVAL;
> +
> +		return failover_ops->slave_join(slave_dev, failover_dev);
> +	}
> +
> +	if (netif_running(failover_dev)) {
> +		err = dev_open(slave_dev);
> +		if (err && (err != -EBUSY)) {
> +			netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
> +				   slave_dev->name, err);
> +			goto err_dev_open;
> +		}
> +	}
> +
> +	/* Align MTU of slave with failover dev */
> +	orig_mtu = slave_dev->mtu;

I suspect this was copied from bond. this variable is never
used and I'm even surprised gcc did not warn about this.


> +	err = dev_set_mtu(slave_dev, failover_dev->mtu);

How do we know slave supports this MTU? same applies to
failover_change_mtu.




> +	if (err) {
> +		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
> +			   slave_dev->name, failover_dev->mtu);
> +		goto err_set_mtu;
> +	}
> +
> +	finfo = netdev_priv(failover_dev);
> +	standby = (slave_dev->dev.parent == failover_dev->dev.parent);
> +
> +	dev_hold(slave_dev);
> +
> +	if (standby) {
> +		rcu_assign_pointer(finfo->standby_dev, slave_dev);
> +		dev_get_stats(finfo->standby_dev, &finfo->standby_stats);
> +	} else {
> +		rcu_assign_pointer(finfo->primary_dev, slave_dev);
> +		dev_get_stats(finfo->primary_dev, &finfo->primary_stats);
> +		failover_dev->min_mtu = slave_dev->min_mtu;
> +		failover_dev->max_mtu = slave_dev->max_mtu;
> +	}
> +
> +	netdev_info(failover_dev, "failover slave:%s joined\n",
> +		    slave_dev->name);
> +
> +	return 0;
> +
> +err_set_mtu:
> +	dev_close(slave_dev);
> +err_dev_open:
> +	return err;
> +}
> +
> +/* Called when slave dev is injecting data into network stack.
> + * Change the associated network device from lower dev to virtio.
> + * note: already called with rcu_read_lock
> + */
> +static rx_handler_result_t failover_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
> +
> +	skb->dev = ndev;
> +
> +	return RX_HANDLER_ANOTHER;
> +}
> +
> +static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
> +{
> +	struct net_device *failover_dev;
> +	struct failover *failover;
> +
> +	spin_lock(&failover_lock);
> +	list_for_each_entry(failover, &failover_list, list) {
> +		failover_dev = rtnl_dereference(failover->failover_dev);
> +		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
> +			*ops = rtnl_dereference(failover->ops);
> +			spin_unlock(&failover_lock);
> +			return failover_dev;
> +		}
> +	}
> +	spin_unlock(&failover_lock);
> +	return NULL;
> +}
> +
> +static int failover_slave_register(struct net_device *slave_dev)
> +{
> +	struct failover_ops *failover_ops;
> +	struct net_device *failover_dev;
> +	int ret;
> +
> +	ASSERT_RTNL();
> +
> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
> +	if (!failover_dev)
> +		goto done;
> +
> +	ret = failover_slave_pre_register(slave_dev, failover_dev,
> +					  failover_ops);
> +	if (ret)
> +		goto done;
> +
> +	ret = netdev_rx_handler_register(slave_dev, failover_ops ?
> +					 failover_ops->handle_frame :
> +					 failover_handle_frame, failover_dev);
> +	if (ret) {
> +		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
> +			   ret);
> +		goto done;
> +	}
> +
> +	ret = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
> +	if (ret) {
> +		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
> +			   failover_dev->name, ret);
> +		goto upper_link_failed;
> +	}
> +
> +	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
> +
> +	ret = failover_slave_join(slave_dev, failover_dev, failover_ops);
> +	if (ret)
> +		goto err_join;
> +
> +	call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
> +
> +	netdev_info(failover_dev, "failover slave:%s registered\n",
> +		    slave_dev->name);
> +
> +	goto done;
> +
> +err_join:
> +	netdev_upper_dev_unlink(slave_dev, failover_dev);
> +	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
> +upper_link_failed:
> +	netdev_rx_handler_unregister(slave_dev);
> +done:
> +	return NOTIFY_DONE;
> +}
> +
> +static int failover_slave_pre_unregister(struct net_device *slave_dev,
> +					 struct net_device *failover_dev,
> +					 struct failover_ops *failover_ops)
> +{
> +	struct net_device *standby_dev, *primary_dev;
> +	struct failover_info *finfo;
> +
> +	if (failover_ops) {
> +		if (!failover_ops->slave_pre_unregister)
> +			return -EINVAL;
> +
> +		return failover_ops->slave_pre_unregister(slave_dev,
> +							  failover_dev);
> +	}
> +
> +	finfo = netdev_priv(failover_dev);
> +	primary_dev = rtnl_dereference(finfo->primary_dev);
> +	standby_dev = rtnl_dereference(finfo->standby_dev);
> +
> +	if (slave_dev != primary_dev && slave_dev != standby_dev)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int failover_slave_release(struct net_device *slave_dev,
> +				  struct net_device *failover_dev,
> +				  struct failover_ops *failover_ops)
> +{
> +	struct net_device *standby_dev, *primary_dev;
> +	struct failover_info *finfo;
> +
> +	if (failover_ops) {
> +		if (!failover_ops->slave_release)
> +			return -EINVAL;
> +
> +		return failover_ops->slave_release(slave_dev, failover_dev);
> +	}
> +
> +	finfo = netdev_priv(failover_dev);
> +	primary_dev = rtnl_dereference(finfo->primary_dev);
> +	standby_dev = rtnl_dereference(finfo->standby_dev);
> +
> +	if (slave_dev == standby_dev) {
> +		RCU_INIT_POINTER(finfo->standby_dev, NULL);
> +	} else {
> +		RCU_INIT_POINTER(finfo->primary_dev, NULL);
> +		if (standby_dev) {
> +			failover_dev->min_mtu = standby_dev->min_mtu;
> +			failover_dev->max_mtu = standby_dev->max_mtu;
> +		}
> +	}
> +
> +	dev_put(slave_dev);
> +
> +	netdev_info(failover_dev, "failover slave:%s released\n",
> +		    slave_dev->name);
> +
> +	return 0;
> +}
> +
> +int failover_slave_unregister(struct net_device *slave_dev)
> +{
> +	struct failover_ops *failover_ops;
> +	struct net_device *failover_dev;
> +	int ret;
> +
> +	if (!netif_is_failover_slave(slave_dev))
> +		goto done;
> +
> +	ASSERT_RTNL();
> +
> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
> +	if (!failover_dev)
> +		goto done;
> +
> +	ret = failover_slave_pre_unregister(slave_dev, failover_dev,
> +					    failover_ops);
> +	if (ret)
> +		goto done;
> +
> +	netdev_rx_handler_unregister(slave_dev);
> +	netdev_upper_dev_unlink(slave_dev, failover_dev);
> +	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
> +
> +	failover_slave_release(slave_dev, failover_dev, failover_ops);


Don't you need to get stats from it? This device is going away ...

> +
> +	netdev_info(failover_dev, "failover slave:%s unregistered\n",
> +		    slave_dev->name);
> +
> +done:
> +	return NOTIFY_DONE;
> +}
> +EXPORT_SYMBOL_GPL(failover_slave_unregister);
> +
> +static bool failover_xmit_ready(struct net_device *dev)
> +{
> +	return netif_running(dev) && netif_carrier_ok(dev);
> +}
> +
> +static int failover_slave_link_change(struct net_device *slave_dev)
> +{
> +	struct net_device *failover_dev, *primary_dev, *standby_dev;
> +	struct failover_ops *failover_ops;
> +	struct failover_info *finfo;
> +
> +	if (!netif_is_failover_slave(slave_dev))
> +		goto done;
> +
> +	ASSERT_RTNL();
> +
> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
> +	if (!failover_dev)
> +		goto done;
> +
> +	if (failover_ops) {
> +		if (!failover_ops->slave_link_change)
> +			goto done;
> +
> +		return failover_ops->slave_link_change(slave_dev, failover_dev);
> +	}
> +
> +	if (!netif_running(failover_dev))
> +		return 0;
> +
> +	finfo = netdev_priv(failover_dev);
> +
> +	primary_dev = rtnl_dereference(finfo->primary_dev);
> +	standby_dev = rtnl_dereference(finfo->standby_dev);
> +
> +	if (slave_dev != primary_dev && slave_dev != standby_dev)
> +		goto done;
> +
> +	if ((primary_dev && failover_xmit_ready(primary_dev)) ||
> +	    (standby_dev && failover_xmit_ready(standby_dev))) {
> +		netif_carrier_on(failover_dev);
> +		netif_tx_wake_all_queues(failover_dev);
> +	} else {
> +		netif_carrier_off(failover_dev);
> +		netif_tx_stop_all_queues(failover_dev);

And I think it's a good idea to get stats from device here too.


> +	}
> +
> +done:
> +	return NOTIFY_DONE;
> +}
> +
> +static bool failover_validate_event_dev(struct net_device *dev)
> +{
> +	/* Skip parent events */
> +	if (netif_is_failover(dev))
> +		return false;
> +
> +	/* Avoid non-Ethernet type devices */

... for now. It would be possible easy to make this generic -
just copy things like type and addr_len from slave.

> +	if (dev->type != ARPHRD_ETHER)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int
> +failover_event(struct notifier_block *this, unsigned long event, void *ptr)
> +{
> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +	if (!failover_validate_event_dev(event_dev))
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return failover_slave_register(event_dev);
> +	case NETDEV_UNREGISTER:
> +		return failover_slave_unregister(event_dev);
> +	case NETDEV_UP:
> +	case NETDEV_DOWN:
> +	case NETDEV_CHANGE:
> +		return failover_slave_link_change(event_dev);
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +}
> +
> +static struct notifier_block failover_notifier = {
> +	.notifier_call = failover_event,
> +};
> +
> +static int failover_open(struct net_device *dev)
> +{
> +	struct failover_info *finfo = netdev_priv(dev);
> +	struct net_device *primary_dev, *standby_dev;
> +	int err;
> +
> +	netif_carrier_off(dev);
> +	netif_tx_wake_all_queues(dev);
> +
> +	primary_dev = rtnl_dereference(finfo->primary_dev);
> +	if (primary_dev) {
> +		err = dev_open(primary_dev);
> +		if (err)
> +			goto err_primary_open;
> +	}
> +
> +	standby_dev = rtnl_dereference(finfo->standby_dev);
> +	if (standby_dev) {
> +		err = dev_open(standby_dev);
> +		if (err)
> +			goto err_standby_open;
> +	}
> +
> +	return 0;
> +
> +err_standby_open:
> +	dev_close(primary_dev);
> +err_primary_open:
> +	netif_tx_disable(dev);
> +	return err;
> +}
> +
> +static int failover_close(struct net_device *dev)
> +{
> +	struct failover_info *finfo = netdev_priv(dev);
> +	struct net_device *slave_dev;
> +
> +	netif_tx_disable(dev);
> +
> +	slave_dev = rtnl_dereference(finfo->primary_dev);
> +	if (slave_dev)
> +		dev_close(slave_dev);
> +
> +	slave_dev = rtnl_dereference(finfo->standby_dev);
> +	if (slave_dev)
> +		dev_close(slave_dev);
> +
> +	return 0;
> +}
> +
> +static netdev_tx_t failover_drop_xmit(struct sk_buff *skb,
> +				      struct net_device *dev)
> +{
> +	atomic_long_inc(&dev->tx_dropped);
> +	dev_kfree_skb_any(skb);
> +	return NETDEV_TX_OK;
> +}
> +
> +static netdev_tx_t failover_start_xmit(struct sk_buff *skb,
> +				       struct net_device *dev)
> +{
> +	struct failover_info *finfo = netdev_priv(dev);
> +	struct net_device *xmit_dev;
> +
> +	/* Try xmit via primary netdev followed by standby netdev */
> +	xmit_dev = rcu_dereference_bh(finfo->primary_dev);
> +	if (!xmit_dev || !failover_xmit_ready(xmit_dev)) {
> +		xmit_dev = rcu_dereference_bh(finfo->standby_dev);
> +		if (!xmit_dev || !failover_xmit_ready(xmit_dev))
> +			return failover_drop_xmit(skb, dev);
> +	}
> +
> +	skb->dev = xmit_dev;
> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +	return dev_queue_xmit(skb);
> +}

Is this going through qdisc twice? Won't this hurt performance
measureably?

> +
> +static u16 failover_select_queue(struct net_device *dev, struct sk_buff *skb,
> +				 void *accel_priv,
> +				 select_queue_fallback_t fallback)
> +{
> +	struct failover_info *finfo = netdev_priv(dev);
> +	struct net_device *primary_dev;
> +	u16 txq;
> +
> +	rcu_read_lock();
> +	primary_dev = rcu_dereference(finfo->primary_dev);
> +	if (primary_dev) {
> +		const struct net_device_ops *ops = primary_dev->netdev_ops;
> +
> +		if (ops->ndo_select_queue)
> +			txq = ops->ndo_select_queue(primary_dev, skb,
> +						    accel_priv, fallback);
> +		else
> +			txq = fallback(primary_dev, skb);
> +
> +		qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
> +
> +		return txq;
> +	}
> +
> +	txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
> +
> +	/* Save the original txq to restore before passing to the driver */
> +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
> +
> +	if (unlikely(txq >= dev->real_num_tx_queues)) {
> +		do {
> +			txq -= dev->real_num_tx_queues;
> +		} while (txq >= dev->real_num_tx_queues);
> +	}
> +
> +	return txq;
> +}
> +
> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
> + * that some drivers can provide 32finfot values only.
> + */
> +static void failover_fold_stats(struct rtnl_link_stats64 *_res,
> +				const struct rtnl_link_stats64 *_new,
> +				const struct rtnl_link_stats64 *_old)
> +{
> +	const u64 *new = (const u64 *)_new;
> +	const u64 *old = (const u64 *)_old;
> +	u64 *res = (u64 *)_res;
> +	int i;
> +
> +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
> +		u64 nv = new[i];
> +		u64 ov = old[i];
> +		s64 delta = nv - ov;
> +
> +		/* detects if this particular field is 32bit only */
> +		if (((nv | ov) >> 32) == 0)
> +			delta = (s64)(s32)((u32)nv - (u32)ov);
> +
> +		/* filter anomalies, some drivers reset their stats
> +		 * at down/up events.
> +		 */
> +		if (delta > 0)
> +			res[i] += delta;
> +	}
> +}
> +
> +static void failover_get_stats(struct net_device *dev,
> +			       struct rtnl_link_stats64 *stats)
> +{
> +	struct failover_info *finfo = netdev_priv(dev);
> +	const struct rtnl_link_stats64 *new;
> +	struct rtnl_link_stats64 temp;
> +	struct net_device *slave_dev;
> +
> +	spin_lock(&finfo->stats_lock);
> +	memcpy(stats, &finfo->failover_stats, sizeof(*stats));
> +
> +	rcu_read_lock();
> +
> +	slave_dev = rcu_dereference(finfo->primary_dev);
> +	if (slave_dev) {
> +		new = dev_get_stats(slave_dev, &temp);
> +		failover_fold_stats(stats, new, &finfo->primary_stats);
> +		memcpy(&finfo->primary_stats, new, sizeof(*new));
> +	}
> +
> +	slave_dev = rcu_dereference(finfo->standby_dev);
> +	if (slave_dev) {
> +		new = dev_get_stats(slave_dev, &temp);
> +		failover_fold_stats(stats, new, &finfo->standby_stats);
> +		memcpy(&finfo->standby_stats, new, sizeof(*new));
> +	}
> +
> +	rcu_read_unlock();
> +
> +	memcpy(&finfo->failover_stats, stats, sizeof(*stats));
> +	spin_unlock(&finfo->stats_lock);
> +}
> +
> +static int failover_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +	struct failover_info *finfo = netdev_priv(dev);
> +	struct net_device *primary_dev, *standby_dev;
> +	int ret = 0;
> +
> +	primary_dev = rcu_dereference(finfo->primary_dev);
> +	if (primary_dev) {
> +		ret = dev_set_mtu(primary_dev, new_mtu);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	standby_dev = rcu_dereference(finfo->standby_dev);
> +	if (standby_dev) {
> +		ret = dev_set_mtu(standby_dev, new_mtu);
> +		if (ret) {
> +			dev_set_mtu(primary_dev, dev->mtu);
> +			return ret;
> +		}
> +	}
> +
> +	dev->mtu = new_mtu;
> +	return 0;
> +}
> +
> +static void failover_set_rx_mode(struct net_device *dev)
> +{
> +	struct failover_info *finfo = netdev_priv(dev);
> +	struct net_device *slave_dev;
> +
> +	rcu_read_lock();
> +
> +	slave_dev = rcu_dereference(finfo->primary_dev);
> +	if (slave_dev) {
> +		dev_uc_sync_multiple(slave_dev, dev);
> +		dev_mc_sync_multiple(slave_dev, dev);
> +	}
> +
> +	slave_dev = rcu_dereference(finfo->standby_dev);
> +	if (slave_dev) {
> +		dev_uc_sync_multiple(slave_dev, dev);
> +		dev_mc_sync_multiple(slave_dev, dev);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
> +static const struct net_device_ops failover_dev_ops = {
> +	.ndo_open		= failover_open,
> +	.ndo_stop		= failover_close,
> +	.ndo_start_xmit		= failover_start_xmit,
> +	.ndo_select_queue	= failover_select_queue,
> +	.ndo_get_stats64	= failover_get_stats,
> +	.ndo_change_mtu		= failover_change_mtu,
> +	.ndo_set_rx_mode	= failover_set_rx_mode,
> +	.ndo_validate_addr	= eth_validate_addr,
> +	.ndo_features_check	= passthru_features_check,

xdp support?

> +};
> +
> +#define FAILOVER_NAME "failover"
> +#define FAILOVER_VERSION "0.1"
> +
> +static void failover_ethtool_get_drvinfo(struct net_device *dev,
> +					 struct ethtool_drvinfo *drvinfo)
> +{
> +	strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
> +	strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
> +}
> +
> +int failover_ethtool_get_link_ksettings(struct net_device *dev,
> +					struct ethtool_link_ksettings *cmd)
> +{
> +	struct failover_info *finfo = netdev_priv(dev);
> +	struct net_device *slave_dev;
> +
> +	slave_dev = rtnl_dereference(finfo->primary_dev);
> +	if (!slave_dev || !failover_xmit_ready(slave_dev)) {
> +		slave_dev = rtnl_dereference(finfo->standby_dev);
> +		if (!slave_dev || !failover_xmit_ready(slave_dev)) {
> +			cmd->base.duplex = DUPLEX_UNKNOWN;
> +			cmd->base.port = PORT_OTHER;
> +			cmd->base.speed = SPEED_UNKNOWN;
> +
> +			return 0;
> +		}
> +	}
> +
> +	return __ethtool_get_link_ksettings(slave_dev, cmd);
> +}
> +EXPORT_SYMBOL_GPL(failover_ethtool_get_link_ksettings);
> +
> +static const struct ethtool_ops failover_ethtool_ops = {
> +	.get_drvinfo            = failover_ethtool_get_drvinfo,
> +	.get_link               = ethtool_op_get_link,
> +	.get_link_ksettings     = failover_ethtool_get_link_ksettings,
> +};
> +
> +static void failover_register_existing_slave(struct net_device *failover_dev)
> +{
> +	struct net *net = dev_net(failover_dev);
> +	struct net_device *dev;
> +
> +	rtnl_lock();
> +	for_each_netdev(net, dev) {
> +		if (dev == failover_dev)
> +			continue;
> +		if (!failover_validate_event_dev(dev))
> +			continue;
> +		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
> +			failover_slave_register(dev);
> +	}
> +	rtnl_unlock();
> +}
> +
> +int failover_register(struct net_device *dev, struct failover_ops *ops,
> +		      struct failover **pfailover)
> +{
> +	struct failover *failover;
> +
> +	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
> +	if (!failover)
> +		return -ENOMEM;
> +
> +	rcu_assign_pointer(failover->ops, ops);
> +	dev_hold(dev);
> +	dev->priv_flags |= IFF_FAILOVER;
> +	rcu_assign_pointer(failover->failover_dev, dev);
> +
> +	spin_lock(&failover_lock);
> +	list_add_tail(&failover->list, &failover_list);
> +	spin_unlock(&failover_lock);
> +
> +	failover_register_existing_slave(dev);
> +
> +	*pfailover = failover;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(failover_register);
> +
> +void failover_unregister(struct failover *failover)
> +{
> +	struct net_device *failover_dev;
> +
> +	failover_dev = rcu_dereference(failover->failover_dev);
> +
> +	failover_dev->priv_flags &= ~IFF_FAILOVER;
> +	dev_put(failover_dev);
> +
> +	spin_lock(&failover_lock);
> +	list_del(&failover->list);
> +	spin_unlock(&failover_lock);
> +
> +	kfree(failover);
> +}
> +EXPORT_SYMBOL_GPL(failover_unregister);
> +
> +int failover_create(struct net_device *standby_dev, struct failover **pfailover)
> +{
> +	struct device *dev = standby_dev->dev.parent;
> +	struct net_device *failover_dev;
> +	int err;
> +
> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
> +	 * that most devices being bonded won't have too many queues.
> +	 */
> +	failover_dev = alloc_etherdev_mq(sizeof(struct failover_info), 16);
> +	if (!failover_dev) {
> +		dev_err(dev, "Unable to allocate failover_netdev!\n");
> +		return -ENOMEM;
> +	}
> +
> +	dev_net_set(failover_dev, dev_net(standby_dev));
> +	SET_NETDEV_DEV(failover_dev, dev);
> +
> +	failover_dev->netdev_ops = &failover_dev_ops;
> +	failover_dev->ethtool_ops = &failover_ethtool_ops;
> +
> +	/* Initialize the device options */
> +	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
> +	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
> +				       IFF_TX_SKB_SHARING);
> +
> +	/* don't acquire failover netdev's netif_tx_lock when transmitting */
> +	failover_dev->features |= NETIF_F_LLTX;
> +
> +	/* Don't allow failover devices to change network namespaces. */
> +	failover_dev->features |= NETIF_F_NETNS_LOCAL;
> +
> +	failover_dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
> +				    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
> +				    NETIF_F_HIGHDMA | NETIF_F_LRO;

OK but then you must make sure your primary and standby both
support these features.

> +
> +	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
> +	failover_dev->features |= failover_dev->hw_features;
> +
> +	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
> +	       failover_dev->addr_len);
> +
> +	failover_dev->min_mtu = standby_dev->min_mtu;
> +	failover_dev->max_mtu = standby_dev->max_mtu;

OK MTU is copied, fine. But is this always enough?

How about e.g. hard_header_len? min_header_len? needed_headroom?
needed_tailroom? I'd worry that even if you cover existing ones more
might be added with time.  A function copying config between devices
probably belongs in some central place IMHO.



> +
> +	err = register_netdev(failover_dev);
> +	if (err < 0) {
> +		dev_err(dev, "Unable to register failover_dev!\n");
> +		goto err_register_netdev;
> +	}
> +
> +	netif_carrier_off(failover_dev);
> +
> +	err = failover_register(failover_dev, NULL, pfailover);
> +	if (err < 0)
> +		goto err_failover;
> +
> +	return 0;
> +
> +err_failover:
> +	unregister_netdev(failover_dev);
> +err_register_netdev:
> +	free_netdev(failover_dev);
> +
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(failover_create);
> +
> +void failover_destroy(struct failover *failover)
> +{
> +	struct net_device *failover_dev;
> +	struct net_device *slave_dev;
> +	struct failover_info *finfo;
> +
> +	if (!failover)
> +		return;
> +
> +	failover_dev = rcu_dereference(failover->failover_dev);
> +	finfo = netdev_priv(failover_dev);
> +
> +	netif_device_detach(failover_dev);
> +
> +	rtnl_lock();
> +
> +	slave_dev = rtnl_dereference(finfo->primary_dev);
> +	if (slave_dev)
> +		failover_slave_unregister(slave_dev);
> +
> +	slave_dev = rtnl_dereference(finfo->standby_dev);
> +	if (slave_dev)
> +		failover_slave_unregister(slave_dev);
> +
> +	failover_unregister(failover);
> +
> +	unregister_netdevice(failover_dev);
> +
> +	rtnl_unlock();
> +
> +	free_netdev(failover_dev);
> +}
> +EXPORT_SYMBOL_GPL(failover_destroy);
> +
> +static __init int
> +failover_init(void)
> +{
> +	register_netdevice_notifier(&failover_notifier);
> +
> +	return 0;
> +}
> +module_init(failover_init);
> +
> +static __exit
> +void failover_exit(void)
> +{
> +	unregister_netdevice_notifier(&failover_notifier);
> +}
> +module_exit(failover_exit);
> +
> +MODULE_DESCRIPTION("Failover infrastructure/interface for Paravirtual drivers");
> +MODULE_LICENSE("GPL v2");
> -- 
> 2.14.3

^ permalink raw reply

* Re: [PATCH v7 net-next 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2018-04-20  2:46 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <1524188524-28411-4-git-send-email-sridhar.samudrala@intel.com>

On Thu, Apr 19, 2018 at 06:42:03PM -0700, Sridhar Samudrala wrote:
> @@ -3010,6 +3043,7 @@ static __init int virtio_net_driver_init(void)
>          ret = register_virtio_driver(&virtio_net_driver);
>  	if (ret)
>  		goto err_virtio;
> +
>  	return 0;
>  err_virtio:
>  	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);

I'd rather you didn't let's not mix in whitespace changes.

> -- 
> 2.14.3

^ permalink raw reply

* Re: [PATCH] virtio_ring: switch to dma_XX barriers for rpmsg
From: Jason Wang @ 2018-04-20  3:04 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Ohad Ben-Cohen, Paolo Bonzini, virtualization, linux-remoteproc,
	Bjorn Andersson
In-Reply-To: <1524159181-351878-1-git-send-email-mst@redhat.com>



On 2018年04月20日 01:35, Michael S. Tsirkin wrote:
> virtio is using barriers to order memory accesses, thus
> dma_wmb/rmb is a good match.
>
> Build-tested on x86: Before
>
> [mst@tuck linux]$ size drivers/virtio/virtio_ring.o
>     text    data     bss     dec     hex filename
>    11392     820       0   12212    2fb4 drivers/virtio/virtio_ring.o
>
> After
> mst@tuck linux]$ size drivers/virtio/virtio_ring.o
>     text    data     bss     dec     hex filename
>    11284     820       0   12104    2f48 drivers/virtio/virtio_ring.o
>
> Cc: Ohad Ben-Cohen <ohad@wizery.com>
> Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
> Cc: linux-remoteproc@vger.kernel.org
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>
> It's good in theory, but could one of RPMSG maintainers please review
> and ack this patch? Or even better test it?
>
> All these barriers are useless on Intel anyway ...
>
>   include/linux/virtio_ring.h | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> index bbf3252..fab0213 100644
> --- a/include/linux/virtio_ring.h
> +++ b/include/linux/virtio_ring.h
> @@ -35,7 +35,7 @@ static inline void virtio_rmb(bool weak_barriers)
>   	if (weak_barriers)
>   		virt_rmb();
>   	else
> -		rmb();
> +		dma_rmb();
>   }
>   
>   static inline void virtio_wmb(bool weak_barriers)
> @@ -43,7 +43,7 @@ static inline void virtio_wmb(bool weak_barriers)
>   	if (weak_barriers)
>   		virt_wmb();
>   	else
> -		wmb();
> +		dma_wmb();
>   }
>   
>   static inline void virtio_store_mb(bool weak_barriers,

Acked-by: Jason Wang <jasowang@redhat.com>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Tiwei Bie @ 2018-04-20  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, kvm, netdev, linux-kernel,
	virtualization, xiao.w.wang, ddutile, jianfeng.tan, zhihong.wang
In-Reply-To: <20180419212911-mutt-send-email-mst@kernel.org>

On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > One problem is that, different virtio ring compatible devices
> > > > > may have different device interfaces. That is to say, we will
> > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > that's what this patch trying to fix. The idea behind this
> > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > in kernel.
> > > > So you just move the abstraction layer from qemu to kernel, and you still
> > > > need different drivers in kernel for different device interfaces of
> > > > accelerators. This looks even more complex than leaving it in qemu. As you
> > > > said, another idea is to implement userspace vhost backend for accelerators
> > > > which seems easier and could co-work with other parts of qemu without
> > > > inventing new type of messages.
> > > I'm not quite sure. Do you think it's acceptable to
> > > add various vendor specific hardware drivers in QEMU?
> > > 
> > 
> > I don't object but we need to figure out the advantages of doing it in qemu
> > too.
> > 
> > Thanks
> 
> To be frank kernel is exactly where device drivers belong.  DPDK did
> move them to userspace but that's merely a requirement for data path.
> *If* you can have them in kernel that is best:
> - update kernel and there's no need to rebuild userspace
> - apps can be written in any language no need to maintain multiple
>   libraries or add wrappers
> - security concerns are much smaller (ok people are trying to
>   raise the bar with IOMMUs and such, but it's already pretty
>   good even without)
> 
> The biggest issue is that you let userspace poke at the
> device which is also allowed by the IOMMU to poke at
> kernel memory (needed for kernel driver to work).

I think the device won't and shouldn't be allowed to
poke at kernel memory. Its kernel driver needs some
kernel memory to work. But the device doesn't have
the access to them. Instead, the device only has the
access to:

(1) the entire memory of the VM (if vIOMMU isn't used)
or
(2) the memory belongs to the guest virtio device (if
    vIOMMU is being used).

Below is the reason:

For the first case, we should program the IOMMU for
the hardware device based on the info in the memory
table which is the entire memory of the VM.

For the second case, we should program the IOMMU for
the hardware device based on the info in the shadow
page table of the vIOMMU.

So the memory can be accessed by the device is limited,
it should be safe especially for the second case.

My concern is that, in this RFC, we don't program the
IOMMU for the mdev device in the userspace via the VFIO
API directly. Instead, we pass the memory table to the
kernel driver via the mdev device (BAR0) and ask the
driver to do the IOMMU programming. Someone may don't
like it. The main reason why we don't program IOMMU via
VFIO API in userspace directly is that, currently IOMMU
drivers don't support mdev bus.

> 
> Yes, maybe if device is not buggy it's all fine, but
> it's better if we do not have to trust the device
> otherwise the security picture becomes more murky.
> 
> I suggested attaching a PASID to (some) queues - see my old post "using
> PASIDs to enable a safe variant of direct ring access".

It's pretty cool. We also have some similar ideas.
Cunming will talk more about this.

Best regards,
Tiwei Bie

> 
> Then using IOMMU with VFIO to limit access through queue to corrent
> ranges of memory.
> 
> 
> -- 
> MST

^ permalink raw reply

* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Michael S. Tsirkin @ 2018-04-20  3:34 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <1524188524-28411-3-git-send-email-sridhar.samudrala@intel.com>

On Thu, Apr 19, 2018 at 06:42:02PM -0700, Sridhar Samudrala wrote:
> +static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
> +{
> +	struct net_device *failover_dev;
> +	struct failover *failover;
> +
> +	spin_lock(&failover_lock);
> +	list_for_each_entry(failover, &failover_list, list) {
> +		failover_dev = rtnl_dereference(failover->failover_dev);
> +		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
> +			*ops = rtnl_dereference(failover->ops);
> +			spin_unlock(&failover_lock);
> +			return failover_dev;
> +		}
> +	}
> +	spin_unlock(&failover_lock);
> +	return NULL;
> +}

So it bothers me that this ties to just any device at all.

I thought hard about it, and I see a nice solution.
Here goes:

QEMU has a pci-bridge-seat device. It is used currently to
group e.g. input devices for multiseat support.

Now if you squint at it hard enough, you can see we are also
in a grouping problem. So how about the following:

1. allocate a virtio pci bridge device failover grouping id (reserve through virtio TC).
2. only group two devices in a failover configuration if both
   are under the same bridge with this id (in addition to a mac check).

In particular a bridged configuration makes it easier to
make sure the standby is enumerated before the primary.
In fact we could fail init of failover if we see
appear standby *after* primary.

And this allows many devices with the same ethernet address
without any issues - just under separate bridges.

Further if we ever want to enumerate in parallel this can
be supported by adding a driver for the bridge.

In fact, I see how down the road such a device could
be the beginning of the more ambitious plan to
expose a powerful switchdev interface for
more advanced 


So far so good, but I see a couple of issues:

- it is PCI specific
	 Not a big deal: we limit ourselves to PCI anyway ATM.

- does not work across PCI domains - which are helpful for NUMA
  (e.g. we want to be able to move to primary
   which is on a different numa
   node without losing connectivity).

	Idea: add a "group ID" register to each of these pci bridge
	devices (e.g. in device specific config space).
	Match two bridges if they have the same group ID.

- all these extra bridges slow enumeration down somewhat

	Idea: as a fallback if no bridge is found,
	just assume all devices match, which will result
	in falling back on the "match by mac" logic like in
	this patchset. Will be fine for simple setups.
	

Thoughts?

-- 
MST

^ permalink raw reply

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Michael S. Tsirkin @ 2018-04-20  3:50 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: alexander.h.duyck, virtio-dev, kvm, netdev, linux-kernel,
	virtualization, xiao.w.wang, ddutile, jianfeng.tan, zhihong.wang
In-Reply-To: <20180420032806.i3jy7xb7emgil6eu@debian>

On Fri, Apr 20, 2018 at 11:28:07AM +0800, Tiwei Bie wrote:
> On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > > One problem is that, different virtio ring compatible devices
> > > > > > may have different device interfaces. That is to say, we will
> > > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > > that's what this patch trying to fix. The idea behind this
> > > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > > in kernel.
> > > > > So you just move the abstraction layer from qemu to kernel, and you still
> > > > > need different drivers in kernel for different device interfaces of
> > > > > accelerators. This looks even more complex than leaving it in qemu. As you
> > > > > said, another idea is to implement userspace vhost backend for accelerators
> > > > > which seems easier and could co-work with other parts of qemu without
> > > > > inventing new type of messages.
> > > > I'm not quite sure. Do you think it's acceptable to
> > > > add various vendor specific hardware drivers in QEMU?
> > > > 
> > > 
> > > I don't object but we need to figure out the advantages of doing it in qemu
> > > too.
> > > 
> > > Thanks
> > 
> > To be frank kernel is exactly where device drivers belong.  DPDK did
> > move them to userspace but that's merely a requirement for data path.
> > *If* you can have them in kernel that is best:
> > - update kernel and there's no need to rebuild userspace
> > - apps can be written in any language no need to maintain multiple
> >   libraries or add wrappers
> > - security concerns are much smaller (ok people are trying to
> >   raise the bar with IOMMUs and such, but it's already pretty
> >   good even without)
> > 
> > The biggest issue is that you let userspace poke at the
> > device which is also allowed by the IOMMU to poke at
> > kernel memory (needed for kernel driver to work).
> 
> I think the device won't and shouldn't be allowed to
> poke at kernel memory. Its kernel driver needs some
> kernel memory to work. But the device doesn't have
> the access to them. Instead, the device only has the
> access to:
> 
> (1) the entire memory of the VM (if vIOMMU isn't used)
> or
> (2) the memory belongs to the guest virtio device (if
>     vIOMMU is being used).
> 
> Below is the reason:
> 
> For the first case, we should program the IOMMU for
> the hardware device based on the info in the memory
> table which is the entire memory of the VM.
> 
> For the second case, we should program the IOMMU for
> the hardware device based on the info in the shadow
> page table of the vIOMMU.
> 
> So the memory can be accessed by the device is limited,
> it should be safe especially for the second case.
> 
> My concern is that, in this RFC, we don't program the
> IOMMU for the mdev device in the userspace via the VFIO
> API directly. Instead, we pass the memory table to the
> kernel driver via the mdev device (BAR0) and ask the
> driver to do the IOMMU programming. Someone may don't
> like it. The main reason why we don't program IOMMU via
> VFIO API in userspace directly is that, currently IOMMU
> drivers don't support mdev bus.

But it is a pci device after all, isn't it?
IOMMU drivers certainly support that ...

Another issue with this approach is that internal
kernel issues leak out to the interface.

> > 
> > Yes, maybe if device is not buggy it's all fine, but
> > it's better if we do not have to trust the device
> > otherwise the security picture becomes more murky.
> > 
> > I suggested attaching a PASID to (some) queues - see my old post "using
> > PASIDs to enable a safe variant of direct ring access".
> 
> It's pretty cool. We also have some similar ideas.
> Cunming will talk more about this.
> 
> Best regards,
> Tiwei Bie

An extra benefit to this could be that requests with PASID
undergo an extra level of translation.
We could use it to avoid the need for shadowing on intel.



Something like this:
- expose to guest a standard virtio device (no pasid support)
- back it by virtio device with pasid support on the host
  by attaching same pasid to all queues

now - guest will build 1 level of page tables

we build first level page tables for requests with pasid
and point the IOMMU to use the guest supplied page tables
for the second level of translation.

Now we do need to forward invalidations but we no
longer need to set the CM bit and shadow valid entries.



> > 
> > Then using IOMMU with VFIO to limit access through queue to corrent
> > ranges of memory.
> > 
> > 
> > -- 
> > MST

^ permalink raw reply

* RE: [RFC] vhost: introduce mdev based hardware vhost backend
From: Liang, Cunming @ 2018-04-20  3:50 UTC (permalink / raw)
  To: Bie, Tiwei, Michael S. Tsirkin
  Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
	kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, Wang, Xiao W,
	ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong
In-Reply-To: <20180420032806.i3jy7xb7emgil6eu@debian>



> -----Original Message-----
> From: Bie, Tiwei
> Sent: Friday, April 20, 2018 11:28 AM
> To: Michael S. Tsirkin <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>; alex.williamson@redhat.com;
> ddutile@redhat.com; Duyck, Alexander H <alexander.h.duyck@intel.com>;
> virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org;
> kvm@vger.kernel.org; virtualization@lists.linux-foundation.org;
> netdev@vger.kernel.org; Daly, Dan <dan.daly@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Tan,
> Jianfeng <jianfeng.tan@intel.com>; Wang, Xiao W <xiao.w.wang@intel.com>;
> Tian, Kevin <kevin.tian@intel.com>
> Subject: Re: [RFC] vhost: introduce mdev based hardware vhost backend
> 
> On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > > One problem is that, different virtio ring compatible devices
> > > > > > may have different device interfaces. That is to say, we will
> > > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > > that's what this patch trying to fix. The idea behind this
> > > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > > in kernel.
> > > > > So you just move the abstraction layer from qemu to kernel, and
> > > > > you still need different drivers in kernel for different device
> > > > > interfaces of accelerators. This looks even more complex than
> > > > > leaving it in qemu. As you said, another idea is to implement
> > > > > userspace vhost backend for accelerators which seems easier and
> > > > > could co-work with other parts of qemu without inventing new type of
> messages.
> > > > I'm not quite sure. Do you think it's acceptable to add various
> > > > vendor specific hardware drivers in QEMU?
> > > >
> > >
> > > I don't object but we need to figure out the advantages of doing it
> > > in qemu too.
> > >
> > > Thanks
> >
> > To be frank kernel is exactly where device drivers belong.  DPDK did
> > move them to userspace but that's merely a requirement for data path.
> > *If* you can have them in kernel that is best:
> > - update kernel and there's no need to rebuild userspace
> > - apps can be written in any language no need to maintain multiple
> >   libraries or add wrappers
> > - security concerns are much smaller (ok people are trying to
> >   raise the bar with IOMMUs and such, but it's already pretty
> >   good even without)
> >
> > The biggest issue is that you let userspace poke at the device which
> > is also allowed by the IOMMU to poke at kernel memory (needed for
> > kernel driver to work).
> 
> I think the device won't and shouldn't be allowed to poke at kernel memory. Its
> kernel driver needs some kernel memory to work. But the device doesn't have
> the access to them. Instead, the device only has the access to:
> 
> (1) the entire memory of the VM (if vIOMMU isn't used) or
> (2) the memory belongs to the guest virtio device (if
>     vIOMMU is being used).
> 
> Below is the reason:
> 
> For the first case, we should program the IOMMU for the hardware device based
> on the info in the memory table which is the entire memory of the VM.
> 
> For the second case, we should program the IOMMU for the hardware device
> based on the info in the shadow page table of the vIOMMU.
> 
> So the memory can be accessed by the device is limited, it should be safe
> especially for the second case.
> 
> My concern is that, in this RFC, we don't program the IOMMU for the mdev
> device in the userspace via the VFIO API directly. Instead, we pass the memory
> table to the kernel driver via the mdev device (BAR0) and ask the driver to do the
> IOMMU programming. Someone may don't like it. The main reason why we don't
> program IOMMU via VFIO API in userspace directly is that, currently IOMMU
> drivers don't support mdev bus.
> 
> >
> > Yes, maybe if device is not buggy it's all fine, but it's better if we
> > do not have to trust the device otherwise the security picture becomes
> > more murky.
> >
> > I suggested attaching a PASID to (some) queues - see my old post
> > "using PASIDs to enable a safe variant of direct ring access".
> 
Ideally we can have a device binding with normal driver in host, meanwhile support to allocate a few queues attaching with PASID on-demand. By vhost mdev transport channel, the data path ability of queues(as a device) can expose to qemu vhost adaptor as a vDPA instance. Then we can avoid VF number limitation, providing vhost data path acceleration in a small granularity.

> It's pretty cool. We also have some similar ideas.
> Cunming will talk more about this.
> 
> Best regards,
> Tiwei Bie
> 
> >
> > Then using IOMMU with VFIO to limit access through queue to corrent
> > ranges of memory.
> >
> >
> > --
> > MST

^ permalink raw reply

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2018-04-20  3:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, kvm, netdev, linux-kernel,
	virtualization, xiao.w.wang, ddutile, jianfeng.tan, zhihong.wang
In-Reply-To: <20180419212911-mutt-send-email-mst@kernel.org>



On 2018年04月20日 02:40, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
>>>>> One problem is that, different virtio ring compatible devices
>>>>> may have different device interfaces. That is to say, we will
>>>>> need different drivers in QEMU. It could be troublesome. And
>>>>> that's what this patch trying to fix. The idea behind this
>>>>> patch is very simple: mdev is a standard way to emulate device
>>>>> in kernel.
>>>> So you just move the abstraction layer from qemu to kernel, and you still
>>>> need different drivers in kernel for different device interfaces of
>>>> accelerators. This looks even more complex than leaving it in qemu. As you
>>>> said, another idea is to implement userspace vhost backend for accelerators
>>>> which seems easier and could co-work with other parts of qemu without
>>>> inventing new type of messages.
>>> I'm not quite sure. Do you think it's acceptable to
>>> add various vendor specific hardware drivers in QEMU?
>>>
>> I don't object but we need to figure out the advantages of doing it in qemu
>> too.
>>
>> Thanks
> To be frank kernel is exactly where device drivers belong.  DPDK did
> move them to userspace but that's merely a requirement for data path.
> *If* you can have them in kernel that is best:
> - update kernel and there's no need to rebuild userspace

Well, you still need to rebuild userspace since a new vhost backend is 
required which relies vhost protocol through mdev API. And I believe 
upgrading userspace package is considered to be more lightweight than 
upgrading kernel. With mdev, we're likely to repeat the story of vhost 
API, dealing with features/versions and inventing new API endless for 
new features. And you will still need to rebuild the userspace.

> - apps can be written in any language no need to maintain multiple
>    libraries or add wrappers

This is not a big issue consider It's not a generic network driver but a 
mdev driver, the only possible user is VM.

> - security concerns are much smaller (ok people are trying to
>    raise the bar with IOMMUs and such, but it's already pretty
>    good even without)

Well, I think not, kernel bugs are much more serious than userspace 
ones. And I beg the kernel driver itself won't be small.

>
> The biggest issue is that you let userspace poke at the
> device which is also allowed by the IOMMU to poke at
> kernel memory (needed for kernel driver to work).

I don't quite get. The userspace driver could be built on top of VFIO 
for sure. So kernel memory were perfectly isolated in this case.

>
> Yes, maybe if device is not buggy it's all fine, but
> it's better if we do not have to trust the device
> otherwise the security picture becomes more murky.
>
> I suggested attaching a PASID to (some) queues - see my old post "using
> PASIDs to enable a safe variant of direct ring access".
>
> Then using IOMMU with VFIO to limit access through queue to corrent
> ranges of memory.

Well userspace driver could benefit from this too. And we can even go 
further by using nested IO page tables to share IOVA address space 
between devices and a VM.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: drm: qxl: what's going on
From: Gerd Hoffmann @ 2018-04-20  6:35 UTC (permalink / raw)
  To: Chunyu Hu; +Cc: airlied, virtualization
In-Reply-To: <870348000.18567429.1524205571170.JavaMail.zimbra@redhat.com>

On Fri, Apr 20, 2018 at 02:26:11AM -0400, Chunyu Hu wrote:
> Hello,
> 
> I am seeing qxl driver is not working in my kvm machine with kernel 4.17,
> (4.16 failed too). I'd like to consult your expert is there a fix for it
> now? 

https://www.kraxel.org/cgit/linux/log/?h=drm-qxl-release

cheers,
  Gerd

^ permalink raw reply

* [PATCH v2 1/4] qxl: remove qxl_io_log()
From: Gerd Hoffmann @ 2018-04-20  7:19 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, David Airlie, open list,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-1-kraxel@redhat.com>

qxl_io_log() sends messages over to the host (qemu) for logging.
Remove the function and all callers, we can just use standard
DRM_DEBUG calls (and if needed a serial console).

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/qxl/qxl_drv.h     |  3 ---
 drivers/gpu/drm/qxl/qxl_cmd.c     | 34 ++--------------------------------
 drivers/gpu/drm/qxl/qxl_display.c | 27 ++++-----------------------
 drivers/gpu/drm/qxl/qxl_fb.c      |  2 --
 drivers/gpu/drm/qxl/qxl_irq.c     |  3 +--
 5 files changed, 7 insertions(+), 62 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_drv.h b/drivers/gpu/drm/qxl/qxl_drv.h
index 00a1a66b05..4b89840173 100644
--- a/drivers/gpu/drm/qxl/qxl_drv.h
+++ b/drivers/gpu/drm/qxl/qxl_drv.h
@@ -298,9 +298,6 @@ struct qxl_device {
 	int monitors_config_height;
 };
 
-/* forward declaration for QXL_INFO_IO */
-__printf(2,3) void qxl_io_log(struct qxl_device *qdev, const char *fmt, ...);
-
 extern const struct drm_ioctl_desc qxl_ioctls[];
 extern int qxl_max_ioctl;
 
diff --git a/drivers/gpu/drm/qxl/qxl_cmd.c b/drivers/gpu/drm/qxl/qxl_cmd.c
index c0fb52c6d4..850f8d7d37 100644
--- a/drivers/gpu/drm/qxl/qxl_cmd.c
+++ b/drivers/gpu/drm/qxl/qxl_cmd.c
@@ -341,12 +341,9 @@ int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
 	surface_height = surf->surf.height;
 
 	if (area->left < 0 || area->top < 0 ||
-	    area->right > surface_width || area->bottom > surface_height) {
-		qxl_io_log(qdev, "%s: not doing area update for "
-			   "%d, (%d,%d,%d,%d) (%d,%d)\n", __func__, surface_id, area->left,
-			   area->top, area->right, area->bottom, surface_width, surface_height);
+	    area->right > surface_width || area->bottom > surface_height)
 		return -EINVAL;
-	}
+
 	mutex_lock(&qdev->update_area_mutex);
 	qdev->ram_header->update_area = *area;
 	qdev->ram_header->update_surface = surface_id;
@@ -407,20 +404,6 @@ void qxl_io_memslot_add(struct qxl_device *qdev, uint8_t id)
 	wait_for_io_cmd(qdev, id, QXL_IO_MEMSLOT_ADD_ASYNC);
 }
 
-void qxl_io_log(struct qxl_device *qdev, const char *fmt, ...)
-{
-	va_list args;
-
-	va_start(args, fmt);
-	vsnprintf(qdev->ram_header->log_buf, QXL_LOG_BUF_SIZE, fmt, args);
-	va_end(args);
-	/*
-	 * DO not do a DRM output here - this will call printk, which will
-	 * call back into qxl for rendering (qxl_fb)
-	 */
-	outb(0, qdev->io_base + QXL_IO_LOG);
-}
-
 void qxl_io_reset(struct qxl_device *qdev)
 {
 	outb(0, qdev->io_base + QXL_IO_RESET);
@@ -428,19 +411,6 @@ void qxl_io_reset(struct qxl_device *qdev)
 
 void qxl_io_monitors_config(struct qxl_device *qdev)
 {
-	qxl_io_log(qdev, "%s: %d [%dx%d+%d+%d]\n", __func__,
-		   qdev->monitors_config ?
-		   qdev->monitors_config->count : -1,
-		   qdev->monitors_config && qdev->monitors_config->count ?
-		   qdev->monitors_config->heads[0].width : -1,
-		   qdev->monitors_config && qdev->monitors_config->count ?
-		   qdev->monitors_config->heads[0].height : -1,
-		   qdev->monitors_config && qdev->monitors_config->count ?
-		   qdev->monitors_config->heads[0].x : -1,
-		   qdev->monitors_config && qdev->monitors_config->count ?
-		   qdev->monitors_config->heads[0].y : -1
-		   );
-
 	wait_for_io_cmd(qdev, 0, QXL_IO_MONITORS_CONFIG_ASYNC);
 }
 
diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
index ecb35ed0ea..dea757ef62 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -48,12 +48,8 @@ static void qxl_alloc_client_monitors_config(struct qxl_device *qdev, unsigned c
 		qdev->client_monitors_config = kzalloc(
 				sizeof(struct qxl_monitors_config) +
 				sizeof(struct qxl_head) * count, GFP_KERNEL);
-		if (!qdev->client_monitors_config) {
-			qxl_io_log(qdev,
-				   "%s: allocation failure for %u heads\n",
-				   __func__, count);
+		if (!qdev->client_monitors_config)
 			return;
-		}
 	}
 	qdev->client_monitors_config->count = count;
 }
@@ -74,12 +70,8 @@ static int qxl_display_copy_rom_client_monitors_config(struct qxl_device *qdev)
 	num_monitors = qdev->rom->client_monitors_config.count;
 	crc = crc32(0, (const uint8_t *)&qdev->rom->client_monitors_config,
 		  sizeof(qdev->rom->client_monitors_config));
-	if (crc != qdev->rom->client_monitors_config_crc) {
-		qxl_io_log(qdev, "crc mismatch: have %X (%zd) != %X\n", crc,
-			   sizeof(qdev->rom->client_monitors_config),
-			   qdev->rom->client_monitors_config_crc);
+	if (crc != qdev->rom->client_monitors_config_crc)
 		return MONITORS_CONFIG_BAD_CRC;
-	}
 	if (!num_monitors) {
 		DRM_DEBUG_KMS("no client monitors configured\n");
 		return status;
@@ -170,12 +162,10 @@ void qxl_display_read_client_monitors_config(struct qxl_device *qdev)
 		udelay(5);
 	}
 	if (status == MONITORS_CONFIG_BAD_CRC) {
-		qxl_io_log(qdev, "config: bad crc\n");
 		DRM_DEBUG_KMS("ignoring client monitors config: bad crc");
 		return;
 	}
 	if (status == MONITORS_CONFIG_UNCHANGED) {
-		qxl_io_log(qdev, "config: unchanged\n");
 		DRM_DEBUG_KMS("ignoring client monitors config: unchanged");
 		return;
 	}
@@ -385,14 +375,6 @@ static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
 				  const struct drm_display_mode *mode,
 				  struct drm_display_mode *adjusted_mode)
 {
-	struct drm_device *dev = crtc->dev;
-	struct qxl_device *qdev = dev->dev_private;
-
-	qxl_io_log(qdev, "%s: (%d,%d) => (%d,%d)\n",
-		   __func__,
-		   mode->hdisplay, mode->vdisplay,
-		   adjusted_mode->hdisplay,
-		   adjusted_mode->vdisplay);
 	return true;
 }
 
@@ -403,10 +385,9 @@ qxl_send_monitors_config(struct qxl_device *qdev)
 
 	BUG_ON(!qdev->ram_header->monitors_config);
 
-	if (qdev->monitors_config->count == 0) {
-		qxl_io_log(qdev, "%s: 0 monitors??\n", __func__);
+	if (qdev->monitors_config->count == 0)
 		return;
-	}
+
 	for (i = 0 ; i < qdev->monitors_config->count ; ++i) {
 		struct qxl_head *head = &qdev->monitors_config->heads[i];
 
diff --git a/drivers/gpu/drm/qxl/qxl_fb.c b/drivers/gpu/drm/qxl/qxl_fb.c
index 338891401f..9a67526060 100644
--- a/drivers/gpu/drm/qxl/qxl_fb.c
+++ b/drivers/gpu/drm/qxl/qxl_fb.c
@@ -185,8 +185,6 @@ static int qxlfb_framebuffer_dirty(struct drm_framebuffer *fb,
 	/*
 	 * we are using a shadow draw buffer, at qdev->surface0_shadow
 	 */
-	qxl_io_log(qdev, "dirty x[%d, %d], y[%d, %d]\n", clips->x1, clips->x2,
-		   clips->y1, clips->y2);
 	image->dx = clips->x1;
 	image->dy = clips->y1;
 	image->width = clips->x2 - clips->x1;
diff --git a/drivers/gpu/drm/qxl/qxl_irq.c b/drivers/gpu/drm/qxl/qxl_irq.c
index 23a40106ab..3bb31add63 100644
--- a/drivers/gpu/drm/qxl/qxl_irq.c
+++ b/drivers/gpu/drm/qxl/qxl_irq.c
@@ -57,10 +57,9 @@ irqreturn_t qxl_irq_handler(int irq, void *arg)
 		 * to avoid endless loops).
 		 */
 		qdev->irq_received_error++;
-		qxl_io_log(qdev, "%s: driver is in bug mode.\n", __func__);
+		DRM_WARN("driver is in bug mode\n");
 	}
 	if (pending & QXL_INTERRUPT_CLIENT_MONITORS_CONFIG) {
-		qxl_io_log(qdev, "QXL_INTERRUPT_CLIENT_MONITORS_CONFIG\n");
 		schedule_work(&qdev->client_monitors_config_work);
 	}
 	qdev->ram_header->int_mask = QXL_INTERRUPT_MASK;
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 2/4] qxl: move qxl_send_monitors_config()
From: Gerd Hoffmann @ 2018-04-20  7:19 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, David Airlie, open list,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-1-kraxel@redhat.com>

Needed to avoid a forward declaration in a followup patch.
Pure code move, no functional change.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/qxl/qxl_display.c | 47 +++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
index dea757ef62..68db4d1965 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -258,6 +258,29 @@ static int qxl_add_common_modes(struct drm_connector *connector,
 	return i - 1;
 }
 
+static void qxl_send_monitors_config(struct qxl_device *qdev)
+{
+	int i;
+
+	BUG_ON(!qdev->ram_header->monitors_config);
+
+	if (qdev->monitors_config->count == 0)
+		return;
+
+	for (i = 0 ; i < qdev->monitors_config->count ; ++i) {
+		struct qxl_head *head = &qdev->monitors_config->heads[i];
+
+		if (head->y > 8192 || head->x > 8192 ||
+		    head->width > 8192 || head->height > 8192) {
+			DRM_ERROR("head %d wrong: %dx%d+%d+%d\n",
+				  i, head->width, head->height,
+				  head->x, head->y);
+			return;
+		}
+	}
+	qxl_io_monitors_config(qdev);
+}
+
 static void qxl_crtc_atomic_flush(struct drm_crtc *crtc,
 				  struct drm_crtc_state *old_crtc_state)
 {
@@ -378,30 +401,6 @@ static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
 	return true;
 }
 
-static void
-qxl_send_monitors_config(struct qxl_device *qdev)
-{
-	int i;
-
-	BUG_ON(!qdev->ram_header->monitors_config);
-
-	if (qdev->monitors_config->count == 0)
-		return;
-
-	for (i = 0 ; i < qdev->monitors_config->count ; ++i) {
-		struct qxl_head *head = &qdev->monitors_config->heads[i];
-
-		if (head->y > 8192 || head->x > 8192 ||
-		    head->width > 8192 || head->height > 8192) {
-			DRM_ERROR("head %d wrong: %dx%d+%d+%d\n",
-				  i, head->width, head->height,
-				  head->x, head->y);
-			return;
-		}
-	}
-	qxl_io_monitors_config(qdev);
-}
-
 static void qxl_monitors_config_set(struct qxl_device *qdev,
 				    int index,
 				    unsigned x, unsigned y,
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 3/4] qxl: hook monitors_config updates into crtc, not encoder.
From: Gerd Hoffmann @ 2018-04-20  7:19 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, David Airlie, open list,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-1-kraxel@redhat.com>

The encoder callbacks are only called in case the video mode changes.
So any layout changes without mode changes will go unnoticed.

Add qxl_crtc_update_monitors_config(), based on the old
qxl_write_monitors_config_for_encoder() function.  Hook it into the
enable, disable and flush atomic crtc callbacks.  Remove monitors_config
updates from all other places.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1544322
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/qxl/qxl_cmd.c     |   2 +
 drivers/gpu/drm/qxl/qxl_display.c | 156 ++++++++++++++++----------------------
 2 files changed, 66 insertions(+), 92 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_cmd.c b/drivers/gpu/drm/qxl/qxl_cmd.c
index 850f8d7d37..95db20f214 100644
--- a/drivers/gpu/drm/qxl/qxl_cmd.c
+++ b/drivers/gpu/drm/qxl/qxl_cmd.c
@@ -371,6 +371,7 @@ void qxl_io_flush_surfaces(struct qxl_device *qdev)
 void qxl_io_destroy_primary(struct qxl_device *qdev)
 {
 	wait_for_io_cmd(qdev, 0, QXL_IO_DESTROY_PRIMARY_ASYNC);
+	qdev->primary_created = false;
 }
 
 void qxl_io_create_primary(struct qxl_device *qdev,
@@ -396,6 +397,7 @@ void qxl_io_create_primary(struct qxl_device *qdev,
 	create->type = QXL_SURF_TYPE_PRIMARY;
 
 	wait_for_io_cmd(qdev, 0, QXL_IO_CREATE_PRIMARY_ASYNC);
+	qdev->primary_created = true;
 }
 
 void qxl_io_memslot_add(struct qxl_device *qdev, uint8_t id)
diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
index 68db4d1965..06ee00b486 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -281,6 +281,66 @@ static void qxl_send_monitors_config(struct qxl_device *qdev)
 	qxl_io_monitors_config(qdev);
 }
 
+static void qxl_crtc_update_monitors_config(struct drm_crtc *crtc,
+					    const char *reason)
+{
+	struct drm_device *dev = crtc->dev;
+	struct qxl_device *qdev = dev->dev_private;
+	struct qxl_crtc *qcrtc = to_qxl_crtc(crtc);
+	struct qxl_head head;
+	int oldcount, i = qcrtc->index;
+
+	if (!qdev->primary_created) {
+		DRM_DEBUG_KMS("no primary surface, skip (%s)\n", reason);
+		return;
+	}
+
+	if (!qdev->monitors_config ||
+	    qdev->monitors_config->max_allowed <= i)
+		return;
+
+	head.id = i;
+	head.flags = 0;
+	oldcount = qdev->monitors_config->count;
+	if (crtc->state->active) {
+		struct drm_display_mode *mode = &crtc->mode;
+		head.width = mode->hdisplay;
+		head.height = mode->vdisplay;
+		head.x = crtc->x;
+		head.y = crtc->y;
+		if (qdev->monitors_config->count < i + 1)
+			qdev->monitors_config->count = i + 1;
+	} else if (i > 0) {
+		head.width = 0;
+		head.height = 0;
+		head.x = 0;
+		head.y = 0;
+		if (qdev->monitors_config->count == i + 1)
+			qdev->monitors_config->count = i;
+	} else {
+		DRM_DEBUG_KMS("inactive head 0, skip (%s)\n", reason);
+		return;
+	}
+
+	if (head.width  == qdev->monitors_config->heads[i].width  &&
+	    head.height == qdev->monitors_config->heads[i].height &&
+	    head.x      == qdev->monitors_config->heads[i].x      &&
+	    head.y      == qdev->monitors_config->heads[i].y      &&
+	    oldcount    == qdev->monitors_config->count)
+		return;
+
+	DRM_DEBUG_KMS("head %d, %dx%d, at +%d+%d, %s (%s)\n",
+		      i, head.width, head.height, head.x, head.y,
+		      crtc->state->active ? "on" : "off", reason);
+	if (oldcount != qdev->monitors_config->count)
+		DRM_DEBUG_KMS("active heads %d -> %d (%d total)\n",
+			      oldcount, qdev->monitors_config->count,
+			      qdev->monitors_config->max_allowed);
+
+	qdev->monitors_config->heads[i] = head;
+	qxl_send_monitors_config(qdev);
+}
+
 static void qxl_crtc_atomic_flush(struct drm_crtc *crtc,
 				  struct drm_crtc_state *old_crtc_state)
 {
@@ -296,6 +356,8 @@ static void qxl_crtc_atomic_flush(struct drm_crtc *crtc,
 		drm_crtc_send_vblank_event(crtc, event);
 		spin_unlock_irqrestore(&dev->event_lock, flags);
 	}
+
+	qxl_crtc_update_monitors_config(crtc, "flush");
 }
 
 static void qxl_crtc_destroy(struct drm_crtc *crtc)
@@ -401,55 +463,20 @@ static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
 	return true;
 }
 
-static void qxl_monitors_config_set(struct qxl_device *qdev,
-				    int index,
-				    unsigned x, unsigned y,
-				    unsigned width, unsigned height,
-				    unsigned surf_id)
-{
-	DRM_DEBUG_KMS("%d:%dx%d+%d+%d\n", index, width, height, x, y);
-	qdev->monitors_config->heads[index].x = x;
-	qdev->monitors_config->heads[index].y = y;
-	qdev->monitors_config->heads[index].width = width;
-	qdev->monitors_config->heads[index].height = height;
-	qdev->monitors_config->heads[index].surface_id = surf_id;
-
-}
-
-static void qxl_mode_set_nofb(struct drm_crtc *crtc)
-{
-	struct qxl_device *qdev = crtc->dev->dev_private;
-	struct qxl_crtc *qcrtc = to_qxl_crtc(crtc);
-	struct drm_display_mode *mode = &crtc->mode;
-
-	DRM_DEBUG("Mode set (%d,%d)\n",
-		  mode->hdisplay, mode->vdisplay);
-
-	qxl_monitors_config_set(qdev, qcrtc->index, 0, 0,
-				mode->hdisplay,	mode->vdisplay, 0);
-
-}
-
 static void qxl_crtc_atomic_enable(struct drm_crtc *crtc,
 				   struct drm_crtc_state *old_state)
 {
-	DRM_DEBUG("\n");
+	qxl_crtc_update_monitors_config(crtc, "enable");
 }
 
 static void qxl_crtc_atomic_disable(struct drm_crtc *crtc,
 				    struct drm_crtc_state *old_state)
 {
-	struct qxl_crtc *qcrtc = to_qxl_crtc(crtc);
-	struct qxl_device *qdev = crtc->dev->dev_private;
-
-	qxl_monitors_config_set(qdev, qcrtc->index, 0, 0, 0, 0, 0);
-
-	qxl_send_monitors_config(qdev);
+	qxl_crtc_update_monitors_config(crtc, "disable");
 }
 
 static const struct drm_crtc_helper_funcs qxl_crtc_helper_funcs = {
 	.mode_fixup = qxl_crtc_mode_fixup,
-	.mode_set_nofb = qxl_mode_set_nofb,
 	.atomic_flush = qxl_crtc_atomic_flush,
 	.atomic_enable = qxl_crtc_atomic_enable,
 	.atomic_disable = qxl_crtc_atomic_disable,
@@ -939,61 +966,8 @@ static void qxl_enc_prepare(struct drm_encoder *encoder)
 	DRM_DEBUG("\n");
 }
 
-static void qxl_write_monitors_config_for_encoder(struct qxl_device *qdev,
-		struct drm_encoder *encoder)
-{
-	int i;
-	struct qxl_output *output = drm_encoder_to_qxl_output(encoder);
-	struct qxl_head *head;
-	struct drm_display_mode *mode;
-
-	BUG_ON(!encoder);
-	/* TODO: ugly, do better */
-	i = output->index;
-	if (!qdev->monitors_config ||
-	    qdev->monitors_config->max_allowed <= i) {
-		DRM_ERROR(
-		"head number too large or missing monitors config: %p, %d",
-		qdev->monitors_config,
-		qdev->monitors_config ?
-			qdev->monitors_config->max_allowed : -1);
-		return;
-	}
-	if (!encoder->crtc) {
-		DRM_ERROR("missing crtc on encoder %p\n", encoder);
-		return;
-	}
-	if (i != 0)
-		DRM_DEBUG("missing for multiple monitors: no head holes\n");
-	head = &qdev->monitors_config->heads[i];
-	head->id = i;
-	if (encoder->crtc->enabled) {
-		mode = &encoder->crtc->mode;
-		head->width = mode->hdisplay;
-		head->height = mode->vdisplay;
-		head->x = encoder->crtc->x;
-		head->y = encoder->crtc->y;
-		if (qdev->monitors_config->count < i + 1)
-			qdev->monitors_config->count = i + 1;
-	} else {
-		head->width = 0;
-		head->height = 0;
-		head->x = 0;
-		head->y = 0;
-	}
-	DRM_DEBUG_KMS("setting head %d to +%d+%d %dx%d out of %d\n",
-		      i, head->x, head->y, head->width, head->height, qdev->monitors_config->count);
-	head->flags = 0;
-	/* TODO - somewhere else to call this for multiple monitors
-	 * (config_commit?) */
-	qxl_send_monitors_config(qdev);
-}
-
 static void qxl_enc_commit(struct drm_encoder *encoder)
 {
-	struct qxl_device *qdev = encoder->dev->dev_private;
-
-	qxl_write_monitors_config_for_encoder(qdev, encoder);
 	DRM_DEBUG("\n");
 }
 
@@ -1080,8 +1054,6 @@ static enum drm_connector_status qxl_conn_detect(
 		     qxl_head_enabled(&qdev->client_monitors_config->heads[output->index]);
 
 	DRM_DEBUG("#%d connected: %d\n", output->index, connected);
-	if (!connected)
-		qxl_monitors_config_set(qdev, output->index, 0, 0, 0, 0, 0);
 
 	return connected ? connector_status_connected
 			 : connector_status_disconnected;
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 4/4] qxl: drop dummy functions
From: Gerd Hoffmann @ 2018-04-20  7:19 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, David Airlie, open list,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-1-kraxel@redhat.com>

These days drm core checks function pointers everywhere before calling
them.  So we can drop a bunch of dummy functions now.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/qxl/qxl_display.c | 50 ---------------------------------------
 1 file changed, 50 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
index 06ee00b486..af6e52af5f 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -456,13 +456,6 @@ qxl_framebuffer_init(struct drm_device *dev,
 	return 0;
 }
 
-static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
-				  const struct drm_display_mode *mode,
-				  struct drm_display_mode *adjusted_mode)
-{
-	return true;
-}
-
 static void qxl_crtc_atomic_enable(struct drm_crtc *crtc,
 				   struct drm_crtc_state *old_state)
 {
@@ -476,7 +469,6 @@ static void qxl_crtc_atomic_disable(struct drm_crtc *crtc,
 }
 
 static const struct drm_crtc_helper_funcs qxl_crtc_helper_funcs = {
-	.mode_fixup = qxl_crtc_mode_fixup,
 	.atomic_flush = qxl_crtc_atomic_flush,
 	.atomic_enable = qxl_crtc_atomic_enable,
 	.atomic_disable = qxl_crtc_atomic_disable,
@@ -620,12 +612,6 @@ static void qxl_primary_atomic_disable(struct drm_plane *plane,
 	}
 }
 
-static int qxl_plane_atomic_check(struct drm_plane *plane,
-				  struct drm_plane_state *state)
-{
-	return 0;
-}
-
 static void qxl_cursor_atomic_update(struct drm_plane *plane,
 				     struct drm_plane_state *old_state)
 {
@@ -831,7 +817,6 @@ static const uint32_t qxl_cursor_plane_formats[] = {
 };
 
 static const struct drm_plane_helper_funcs qxl_cursor_helper_funcs = {
-	.atomic_check = qxl_plane_atomic_check,
 	.atomic_update = qxl_cursor_atomic_update,
 	.atomic_disable = qxl_cursor_atomic_disable,
 	.prepare_fb = qxl_plane_prepare_fb,
@@ -956,28 +941,6 @@ static int qdev_crtc_init(struct drm_device *dev, int crtc_id)
 	return r;
 }
 
-static void qxl_enc_dpms(struct drm_encoder *encoder, int mode)
-{
-	DRM_DEBUG("\n");
-}
-
-static void qxl_enc_prepare(struct drm_encoder *encoder)
-{
-	DRM_DEBUG("\n");
-}
-
-static void qxl_enc_commit(struct drm_encoder *encoder)
-{
-	DRM_DEBUG("\n");
-}
-
-static void qxl_enc_mode_set(struct drm_encoder *encoder,
-				struct drm_display_mode *mode,
-				struct drm_display_mode *adjusted_mode)
-{
-	DRM_DEBUG("\n");
-}
-
 static int qxl_conn_get_modes(struct drm_connector *connector)
 {
 	unsigned pwidth = 1024;
@@ -1023,10 +986,6 @@ static struct drm_encoder *qxl_best_encoder(struct drm_connector *connector)
 
 
 static const struct drm_encoder_helper_funcs qxl_enc_helper_funcs = {
-	.dpms = qxl_enc_dpms,
-	.prepare = qxl_enc_prepare,
-	.mode_set = qxl_enc_mode_set,
-	.commit = qxl_enc_commit,
 };
 
 static const struct drm_connector_helper_funcs qxl_connector_helper_funcs = {
@@ -1059,14 +1018,6 @@ static enum drm_connector_status qxl_conn_detect(
 			 : connector_status_disconnected;
 }
 
-static int qxl_conn_set_property(struct drm_connector *connector,
-				   struct drm_property *property,
-				   uint64_t value)
-{
-	DRM_DEBUG("\n");
-	return 0;
-}
-
 static void qxl_conn_destroy(struct drm_connector *connector)
 {
 	struct qxl_output *qxl_output =
@@ -1081,7 +1032,6 @@ static const struct drm_connector_funcs qxl_connector_funcs = {
 	.dpms = drm_helper_connector_dpms,
 	.detect = qxl_conn_detect,
 	.fill_modes = drm_helper_probe_single_connector_modes,
-	.set_property = qxl_conn_set_property,
 	.destroy = qxl_conn_destroy,
 	.reset = drm_atomic_helper_connector_reset,
 	.atomic_duplicate_state = drm_atomic_helper_connector_duplicate_state,
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH] drm/virtio: fix vq wait_event condition
From: Gerd Hoffmann @ 2018-04-20  7:22 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, open list, amagloire, stable,
	open list:VIRTIO GPU DRIVER
In-Reply-To: <20180403095904.11152-1-kraxel@redhat.com>

On Tue, Apr 03, 2018 at 11:59:04AM +0200, Gerd Hoffmann wrote:
> Wait until we have enough space in the virt queue to actually queue up
> our request.  Avoids the guest spinning in case we have a non-zero
> amount of free entries but not enough for the request.

Ping airlied, can you please either pick it up or review so I can commit
myself?

thanks,
  Gerd

> Cc: stable@vger.kernel.org
> Reported-by: Alain Magloire <amagloire@blackberry.com>
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>  drivers/gpu/drm/virtio/virtgpu_vq.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c b/drivers/gpu/drm/virtio/virtgpu_vq.c
> index 48e4f1df6e..020070d483 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_vq.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
> @@ -293,7 +293,7 @@ static int virtio_gpu_queue_ctrl_buffer_locked(struct virtio_gpu_device *vgdev,
>  	ret = virtqueue_add_sgs(vq, sgs, outcnt, incnt, vbuf, GFP_ATOMIC);
>  	if (ret == -ENOSPC) {
>  		spin_unlock(&vgdev->ctrlq.qlock);
> -		wait_event(vgdev->ctrlq.ack_queue, vq->num_free);
> +		wait_event(vgdev->ctrlq.ack_queue, vq->num_free >= outcnt + incnt);
>  		spin_lock(&vgdev->ctrlq.qlock);
>  		goto retry;
>  	} else {
> @@ -368,7 +368,7 @@ static int virtio_gpu_queue_cursor(struct virtio_gpu_device *vgdev,
>  	ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
>  	if (ret == -ENOSPC) {
>  		spin_unlock(&vgdev->cursorq.qlock);
> -		wait_event(vgdev->cursorq.ack_queue, vq->num_free);
> +		wait_event(vgdev->cursorq.ack_queue, vq->num_free >= outcnt);
>  		spin_lock(&vgdev->cursorq.qlock);
>  		goto retry;
>  	} else {
> -- 
> 2.9.3
> 

^ permalink raw reply

* Re: [PATCH 1/2] qxl: fix qxl_release_{map,unmap}
From: Daniel Vetter @ 2018-04-20  7:48 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: David Airlie, Dave Airlie, open list, dri-devel,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180418054257.15388-2-kraxel@redhat.com>

On Wed, Apr 18, 2018 at 07:42:56AM +0200, Gerd Hoffmann wrote:
> s/PAGE_SIZE/PAGE_MASK/
> 
> Luckily release_offset is never larger than PAGE_SIZE, so the bug has no
> bad side effects and managed to stay unnoticed for years that way ...
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

Sweeet. Since the buggy code uses the same expression for page frame and
offset I don't think there's a security bug. You might still want to cc:
stable (since without you defacto can't ever use this feature).

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/qxl/qxl_ioctl.c   | 4 ++--
>  drivers/gpu/drm/qxl/qxl_release.c | 6 +++---
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/qxl/qxl_ioctl.c b/drivers/gpu/drm/qxl/qxl_ioctl.c
> index e238a1a2ec..6cc9f3367f 100644
> --- a/drivers/gpu/drm/qxl/qxl_ioctl.c
> +++ b/drivers/gpu/drm/qxl/qxl_ioctl.c
> @@ -182,9 +182,9 @@ static int qxl_process_single_command(struct qxl_device *qdev,
>  		goto out_free_reloc;
>  
>  	/* TODO copy slow path code from i915 */
> -	fb_cmd = qxl_bo_kmap_atomic_page(qdev, cmd_bo, (release->release_offset & PAGE_SIZE));
> +	fb_cmd = qxl_bo_kmap_atomic_page(qdev, cmd_bo, (release->release_offset & PAGE_MASK));
>  	unwritten = __copy_from_user_inatomic_nocache
> -		(fb_cmd + sizeof(union qxl_release_info) + (release->release_offset & ~PAGE_SIZE),
> +		(fb_cmd + sizeof(union qxl_release_info) + (release->release_offset & ~PAGE_MASK),
>  		 u64_to_user_ptr(cmd->command), cmd->command_size);
>  
>  	{
> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
> index 5d84a66fed..a0b4244d28 100644
> --- a/drivers/gpu/drm/qxl/qxl_release.c
> +++ b/drivers/gpu/drm/qxl/qxl_release.c
> @@ -411,10 +411,10 @@ union qxl_release_info *qxl_release_map(struct qxl_device *qdev,
>  	struct qxl_bo_list *entry = list_first_entry(&release->bos, struct qxl_bo_list, tv.head);
>  	struct qxl_bo *bo = to_qxl_bo(entry->tv.bo);
>  
> -	ptr = qxl_bo_kmap_atomic_page(qdev, bo, release->release_offset & PAGE_SIZE);
> +	ptr = qxl_bo_kmap_atomic_page(qdev, bo, release->release_offset & PAGE_MASK);
>  	if (!ptr)
>  		return NULL;
> -	info = ptr + (release->release_offset & ~PAGE_SIZE);
> +	info = ptr + (release->release_offset & ~PAGE_MASK);
>  	return info;
>  }
>  
> @@ -426,7 +426,7 @@ void qxl_release_unmap(struct qxl_device *qdev,
>  	struct qxl_bo *bo = to_qxl_bo(entry->tv.bo);
>  	void *ptr;
>  
> -	ptr = ((void *)info) - (release->release_offset & ~PAGE_SIZE);
> +	ptr = ((void *)info) - (release->release_offset & ~PAGE_MASK);
>  	qxl_bo_kunmap_atomic_page(qdev, bo, ptr);
>  }
>  
> -- 
> 2.9.3
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply

* Re: [PATCH v2 4/4] qxl: drop dummy functions
From: Daniel Vetter @ 2018-04-20  8:14 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: David Airlie, David Airlie, open list, dri-devel,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-5-kraxel@redhat.com>

On Fri, Apr 20, 2018 at 09:19:04AM +0200, Gerd Hoffmann wrote:
> These days drm core checks function pointers everywhere before calling
> them.  So we can drop a bunch of dummy functions now.
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>  drivers/gpu/drm/qxl/qxl_display.c | 50 ---------------------------------------
>  1 file changed, 50 deletions(-)
> 
> diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
> index 06ee00b486..af6e52af5f 100644
> --- a/drivers/gpu/drm/qxl/qxl_display.c
> +++ b/drivers/gpu/drm/qxl/qxl_display.c
> @@ -456,13 +456,6 @@ qxl_framebuffer_init(struct drm_device *dev,
>  	return 0;
>  }
>  
> -static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
> -				  const struct drm_display_mode *mode,
> -				  struct drm_display_mode *adjusted_mode)
> -{
> -	return true;
> -}
> -
>  static void qxl_crtc_atomic_enable(struct drm_crtc *crtc,
>  				   struct drm_crtc_state *old_state)
>  {
> @@ -476,7 +469,6 @@ static void qxl_crtc_atomic_disable(struct drm_crtc *crtc,
>  }
>  
>  static const struct drm_crtc_helper_funcs qxl_crtc_helper_funcs = {
> -	.mode_fixup = qxl_crtc_mode_fixup,
>  	.atomic_flush = qxl_crtc_atomic_flush,
>  	.atomic_enable = qxl_crtc_atomic_enable,
>  	.atomic_disable = qxl_crtc_atomic_disable,
> @@ -620,12 +612,6 @@ static void qxl_primary_atomic_disable(struct drm_plane *plane,
>  	}
>  }
>  
> -static int qxl_plane_atomic_check(struct drm_plane *plane,
> -				  struct drm_plane_state *state)
> -{
> -	return 0;
> -}
> -
>  static void qxl_cursor_atomic_update(struct drm_plane *plane,
>  				     struct drm_plane_state *old_state)
>  {
> @@ -831,7 +817,6 @@ static const uint32_t qxl_cursor_plane_formats[] = {
>  };
>  
>  static const struct drm_plane_helper_funcs qxl_cursor_helper_funcs = {
> -	.atomic_check = qxl_plane_atomic_check,
>  	.atomic_update = qxl_cursor_atomic_update,
>  	.atomic_disable = qxl_cursor_atomic_disable,
>  	.prepare_fb = qxl_plane_prepare_fb,
> @@ -956,28 +941,6 @@ static int qdev_crtc_init(struct drm_device *dev, int crtc_id)
>  	return r;
>  }
>  
> -static void qxl_enc_dpms(struct drm_encoder *encoder, int mode)
> -{
> -	DRM_DEBUG("\n");
> -}
> -
> -static void qxl_enc_prepare(struct drm_encoder *encoder)
> -{
> -	DRM_DEBUG("\n");
> -}
> -
> -static void qxl_enc_commit(struct drm_encoder *encoder)
> -{
> -	DRM_DEBUG("\n");
> -}
> -
> -static void qxl_enc_mode_set(struct drm_encoder *encoder,
> -				struct drm_display_mode *mode,
> -				struct drm_display_mode *adjusted_mode)
> -{
> -	DRM_DEBUG("\n");
> -}
> -
>  static int qxl_conn_get_modes(struct drm_connector *connector)
>  {
>  	unsigned pwidth = 1024;
> @@ -1023,10 +986,6 @@ static struct drm_encoder *qxl_best_encoder(struct drm_connector *connector)
>  
>  
>  static const struct drm_encoder_helper_funcs qxl_enc_helper_funcs = {

Hm, I thought even the vtable itself is optional? We do have tons of if
(!funcs) checks all over the place at least.


> -	.dpms = qxl_enc_dpms,
> -	.prepare = qxl_enc_prepare,
> -	.mode_set = qxl_enc_mode_set,
> -	.commit = qxl_enc_commit,
>  };
>  
>  static const struct drm_connector_helper_funcs qxl_connector_helper_funcs = {
> @@ -1059,14 +1018,6 @@ static enum drm_connector_status qxl_conn_detect(
>  			 : connector_status_disconnected;
>  }
>  
> -static int qxl_conn_set_property(struct drm_connector *connector,
> -				   struct drm_property *property,
> -				   uint64_t value)
> -{
> -	DRM_DEBUG("\n");
> -	return 0;
> -}
> -
>  static void qxl_conn_destroy(struct drm_connector *connector)
>  {
>  	struct qxl_output *qxl_output =
> @@ -1081,7 +1032,6 @@ static const struct drm_connector_funcs qxl_connector_funcs = {
>  	.dpms = drm_helper_connector_dpms,
>  	.detect = qxl_conn_detect,
>  	.fill_modes = drm_helper_probe_single_connector_modes,
> -	.set_property = qxl_conn_set_property,

This is a legacy callback anyway. I kinda wonder whether we should have a
bunch of WARN_ON checks for this, so that atomic drivers don't have these
hanging around anymore.

Anyway, patch looks pretty.

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>  	.destroy = qxl_conn_destroy,
>  	.reset = drm_atomic_helper_connector_reset,
>  	.atomic_duplicate_state = drm_atomic_helper_connector_duplicate_state,
> -- 
> 2.9.3
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox