Linux virtualization list
 help / color / mirror / Atom feed
* Re: [virtio-dev] Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Michael S. Tsirkin @ 2018-04-20 16:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, Netdev,
	David Miller
In-Reply-To: <CAKgT0UeQTx7zJPK3K3eM9xxHfVyHXwJ-G_b8eqGn0bWAyt9aAg@mail.gmail.com>

On Fri, Apr 20, 2018 at 08:56:57AM -0700, Alexander Duyck wrote:
> On Fri, Apr 20, 2018 at 8:34 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Apr 20, 2018 at 08:21:00AM -0700, Samudrala, Sridhar wrote:
> >> > > + finfo = netdev_priv(failover_dev);
> >> > > +
> >> > > + primary_dev = rtnl_dereference(finfo->primary_dev);
> >> > > + standby_dev = rtnl_dereference(finfo->standby_dev);
> >> > > +
> >> > > + if (slave_dev != primary_dev && slave_dev != standby_dev)
> >> > > +         goto done;
> >> > > +
> >> > > + if ((primary_dev && failover_xmit_ready(primary_dev)) ||
> >> > > +     (standby_dev && failover_xmit_ready(standby_dev))) {
> >> > > +         netif_carrier_on(failover_dev);
> >> > > +         netif_tx_wake_all_queues(failover_dev);
> >> > > + } else {
> >> > > +         netif_carrier_off(failover_dev);
> >> > > +         netif_tx_stop_all_queues(failover_dev);
> >> > And I think it's a good idea to get stats from device here too.
> >>
> >> Not sure why we need to get stats from lower devs here?
> >
> > link down is often indication of a hardware problem.
> > lower dev might stop responding down the road.
> >
> >> > > +static const struct net_device_ops failover_dev_ops = {
> >> > > + .ndo_open               = failover_open,
> >> > > + .ndo_stop               = failover_close,
> >> > > + .ndo_start_xmit         = failover_start_xmit,
> >> > > + .ndo_select_queue       = failover_select_queue,
> >> > > + .ndo_get_stats64        = failover_get_stats,
> >> > > + .ndo_change_mtu         = failover_change_mtu,
> >> > > + .ndo_set_rx_mode        = failover_set_rx_mode,
> >> > > + .ndo_validate_addr      = eth_validate_addr,
> >> > > + .ndo_features_check     = passthru_features_check,
> >> > xdp support?
> >>
> >> I think it should be possible to add it be calling the lower dev ndo_xdp routines
> >> with proper checks. can we add this later?
> >
> > I'd be concerned that if you don't xdp userspace will keep poking
> > at lower devs. Then it will stop working if you add this later.
> 
> The failover device is better off not providing in-driver XDP since
> there are already skbs allocated by the time that we see the packet
> here anyway. As such generic XDP is the preferred way to handle this
> since it will work regardless of what lower devices are present.
>
> The only advantage of having XDP down at the virtio or VF level would
> be that it performs better, but at the cost of complexity since we
> would need to rebind the eBPF program any time a device is hotplugged
> out and then back in. For now the current approach is in keeping with
> how bonding and other similar drivers are currently handling this.
> 
> Thanks.
> 
> - Alex

OK fair enough.

-- 
MST

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Jiri Pirko @ 2018-04-20 16:00 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, davem
In-Reply-To: <20180420082802.6ca37e4c@xeon-e3>

Fri, Apr 20, 2018 at 05:28:02PM CEST, stephen@networkplumber.org wrote:
>On Thu, 19 Apr 2018 18:42:04 -0700
>Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> failover infrastructure.
>> 
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>
>Do what you want to other devices but leave netvsc alone.
>Adding these failover ops does not reduce the code size, and really is
>no benefit.  The netvsc device driver needs to be backported to several
>other distributions and doing this makes that harder.

We should not care about the backport burden when we are trying to make
things right. And things are not right. The current netvsc approach is
just plain wrong shortcut. It should have been done in a generic way
from the very beginning. We are just trying to fix this situation.

Moreover, I believe that part of the fix is to convert netvsc to 3
netdev solution too. 2 netdev model is wrong.


>
>I will NAK patches to change to common code for netvsc especially the
>three device model.  MS worked hard with distro vendors to support transparent
>mode, ans we really can't have a new model; or do backport.
>
>Plus, DPDK is now dependent on existing model.

Sorry, but nobody here cares about dpdk or other similar oddities.

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Alexander Duyck @ 2018-04-20 15:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, Netdev,
	David Miller
In-Reply-To: <20180420183021-mutt-send-email-mst@kernel.org>

On Fri, Apr 20, 2018 at 8:34 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Apr 20, 2018 at 08:21:00AM -0700, Samudrala, Sridhar wrote:
>> > > + finfo = netdev_priv(failover_dev);
>> > > +
>> > > + primary_dev = rtnl_dereference(finfo->primary_dev);
>> > > + standby_dev = rtnl_dereference(finfo->standby_dev);
>> > > +
>> > > + if (slave_dev != primary_dev && slave_dev != standby_dev)
>> > > +         goto done;
>> > > +
>> > > + if ((primary_dev && failover_xmit_ready(primary_dev)) ||
>> > > +     (standby_dev && failover_xmit_ready(standby_dev))) {
>> > > +         netif_carrier_on(failover_dev);
>> > > +         netif_tx_wake_all_queues(failover_dev);
>> > > + } else {
>> > > +         netif_carrier_off(failover_dev);
>> > > +         netif_tx_stop_all_queues(failover_dev);
>> > And I think it's a good idea to get stats from device here too.
>>
>> Not sure why we need to get stats from lower devs here?
>
> link down is often indication of a hardware problem.
> lower dev might stop responding down the road.
>
>> > > +static const struct net_device_ops failover_dev_ops = {
>> > > + .ndo_open               = failover_open,
>> > > + .ndo_stop               = failover_close,
>> > > + .ndo_start_xmit         = failover_start_xmit,
>> > > + .ndo_select_queue       = failover_select_queue,
>> > > + .ndo_get_stats64        = failover_get_stats,
>> > > + .ndo_change_mtu         = failover_change_mtu,
>> > > + .ndo_set_rx_mode        = failover_set_rx_mode,
>> > > + .ndo_validate_addr      = eth_validate_addr,
>> > > + .ndo_features_check     = passthru_features_check,
>> > xdp support?
>>
>> I think it should be possible to add it be calling the lower dev ndo_xdp routines
>> with proper checks. can we add this later?
>
> I'd be concerned that if you don't xdp userspace will keep poking
> at lower devs. Then it will stop working if you add this later.

The failover device is better off not providing in-driver XDP since
there are already skbs allocated by the time that we see the packet
here anyway. As such generic XDP is the preferred way to handle this
since it will work regardless of what lower devices are present.

The only advantage of having XDP down at the virtio or VF level would
be that it performs better, but at the cost of complexity since we
would need to rebind the eBPF program any time a device is hotplugged
out and then back in. For now the current approach is in keeping with
how bonding and other similar drivers are currently handling this.

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: David Miller @ 2018-04-20 15:47 UTC (permalink / raw)
  To: mst
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, sridhar.samudrala,
	virtualization, loseweigh, netdev
In-Reply-To: <20180420183505-mutt-send-email-mst@kernel.org>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 20 Apr 2018 18:43:54 +0300

> On Fri, Apr 20, 2018 at 08:28:02AM -0700, Stephen Hemminger wrote:
>> Plus, DPDK is now dependent on existing model.
> 
> DPDK does the kernel bypass thing, doesn't it? Why does the kernel care?

+1

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Samudrala, Sridhar @ 2018-04-20 15:46 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <20180420082802.6ca37e4c@xeon-e3>



On 4/20/2018 8:28 AM, Stephen Hemminger wrote:
> On Thu, 19 Apr 2018 18:42:04 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> failover infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Do what you want to other devices but leave netvsc alone.
> Adding these failover ops does not reduce the code size, and really is
> no benefit.  The netvsc device driver needs to be backported to several
> other distributions and doing this makes that harder.
>
> I will NAK patches to change to common code for netvsc especially the
> three device model.  MS worked hard with distro vendors to support transparent
> mode, ans we really can't have a new model; or do backport.

failover_ops are specifically added to support both 2-netdev and 3-netdev models
This patch doesn't change netvsc model. It still keeps its 2-netdev model. From
netvsc, point of view it is just moving some code from netvsc to the failover module
and also i think the eventhandling and getbymac routines are more optimal.


> Plus, DPDK is now dependent on existing model.

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: David Miller @ 2018-04-20 15:46 UTC (permalink / raw)
  To: stephen
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici,
	sridhar.samudrala, virtualization, loseweigh, netdev
In-Reply-To: <20180420082802.6ca37e4c@xeon-e3>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Fri, 20 Apr 2018 08:28:02 -0700

> On Thu, 19 Apr 2018 18:42:04 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
>> Use the registration/notification framework supported by the generic
>> failover infrastructure.
>> 
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> 
> Do what you want to other devices but leave netvsc alone.
> Adding these failover ops does not reduce the code size, and really is
> no benefit.  The netvsc device driver needs to be backported to several
> other distributions and doing this makes that harder.
> 
> I will NAK patches to change to common code for netvsc especially the
> three device model.  MS worked hard with distro vendors to support transparent
> mode, ans we really can't have a new model; or do backport.
> 
> Plus, DPDK is now dependent on existing model.

Stephen, I understand your situation.

But the result we have now is undesirable and it happened because MS
worked with distro vendors on this rather than the upstream community
and entities with other device with similar needs.

Please next time do the latter rather than the former.

Thank you.

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-20 15:43 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, davem
In-Reply-To: <20180420082802.6ca37e4c@xeon-e3>

On Fri, Apr 20, 2018 at 08:28:02AM -0700, Stephen Hemminger wrote:
> On Thu, 19 Apr 2018 18:42:04 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > failover infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> 
> Do what you want to other devices but leave netvsc alone.
> Adding these failover ops does not reduce the code size,

drivers/net/hyperv/Kconfig      |   1 +
drivers/net/hyperv/hyperv_net.h |   2 +
drivers/net/hyperv/netvsc_drv.c | 208 ++++++++++------------------------------
3 files changed, 55 insertions(+), 156 deletions(-)

100 lines gone.


> and really is
> no benefit.  The netvsc device driver needs to be backported to several
> other distributions and doing this makes that harder.
> 
> I will NAK patches to change to common code for netvsc

Wow.

> especially the
> three device model.

AFAIK these patches do not change netvsc to a three device model.

> MS worked hard with distro vendors to support transparent
> mode, ans we really can't have a new model;

That's why Sridhar worked hard to preserve a 2 device model for netvsc.

> or do backport.
>
> Plus, DPDK is now dependent on existing model.

DPDK does the kernel bypass thing, doesn't it? Why does the kernel care?

-- 
MST

^ permalink raw reply

* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Michael S. Tsirkin @ 2018-04-20 15:34 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <d1bf0771-751b-85bc-4890-8232283131a2@intel.com>

On Fri, Apr 20, 2018 at 08:21:00AM -0700, Samudrala, Sridhar wrote:
> > > +	finfo = netdev_priv(failover_dev);
> > > +
> > > +	primary_dev = rtnl_dereference(finfo->primary_dev);
> > > +	standby_dev = rtnl_dereference(finfo->standby_dev);
> > > +
> > > +	if (slave_dev != primary_dev && slave_dev != standby_dev)
> > > +		goto done;
> > > +
> > > +	if ((primary_dev && failover_xmit_ready(primary_dev)) ||
> > > +	    (standby_dev && failover_xmit_ready(standby_dev))) {
> > > +		netif_carrier_on(failover_dev);
> > > +		netif_tx_wake_all_queues(failover_dev);
> > > +	} else {
> > > +		netif_carrier_off(failover_dev);
> > > +		netif_tx_stop_all_queues(failover_dev);
> > And I think it's a good idea to get stats from device here too.
> 
> Not sure why we need to get stats from lower devs here?

link down is often indication of a hardware problem.
lower dev might stop responding down the road.

> > > +static const struct net_device_ops failover_dev_ops = {
> > > +	.ndo_open		= failover_open,
> > > +	.ndo_stop		= failover_close,
> > > +	.ndo_start_xmit		= failover_start_xmit,
> > > +	.ndo_select_queue	= failover_select_queue,
> > > +	.ndo_get_stats64	= failover_get_stats,
> > > +	.ndo_change_mtu		= failover_change_mtu,
> > > +	.ndo_set_rx_mode	= failover_set_rx_mode,
> > > +	.ndo_validate_addr	= eth_validate_addr,
> > > +	.ndo_features_check	= passthru_features_check,
> > xdp support?
> 
> I think it should be possible to add it be calling the lower dev ndo_xdp routines
> with proper checks. can we add this later?

I'd be concerned that if you don't xdp userspace will keep poking
at lower devs. Then it will stop working if you add this later.

-- 
MST

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-20 15:28 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <1524188524-28411-5-git-send-email-sridhar.samudrala@intel.com>

On Thu, 19 Apr 2018 18:42:04 -0700
Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:

> Use the registration/notification framework supported by the generic
> failover infrastructure.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

Do what you want to other devices but leave netvsc alone.
Adding these failover ops does not reduce the code size, and really is
no benefit.  The netvsc device driver needs to be backported to several
other distributions and doing this makes that harder.

I will NAK patches to change to common code for netvsc especially the
three device model.  MS worked hard with distro vendors to support transparent
mode, ans we really can't have a new model; or do backport.

Plus, DPDK is now dependent on existing model.

^ permalink raw reply

* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Samudrala, Sridhar @ 2018-04-20 15:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <20180420045657-mutt-send-email-mst@kernel.org>


On 4/19/2018 7:44 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 19, 2018 at 06:42:02PM -0700, Sridhar Samudrala wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation.
>>
>> It exposes 2 sets of interfaces to the paravirtual drivers.
>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> master netdev is created. The paravirtual driver registers each instance
>> of netvsc as a 'failover' instance  along with a set of ops to manage the
>> slave events.
>>       failover_register()
>>       failover_unregister()
>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> the failover module provides interfaces to create/destroy additional master
>> netdev and all the slave events are managed internally.
>>        failover_create()
>>        failover_destroy()
>> These functions call failover_register()/failover_unregister() with the
>> master netdev created by the failover module.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> I like this patch. Yes something to improve (see below)
>
>> ---
>>   include/linux/netdevice.h |  16 +
>>   include/net/failover.h    |  96 ++++++
>>   net/Kconfig               |  18 +
>>   net/core/Makefile         |   1 +
>>   net/core/failover.c       | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>>   5 files changed, 975 insertions(+)
>>   create mode 100644 include/net/failover.h
>>   create mode 100644 net/core/failover.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cf44503ea81a..ed535b6724e1 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1401,6 +1401,8 @@ struct net_device_ops {
>>    *	entity (i.e. the master device for bridged veth)
>>    * @IFF_MACSEC: device is a MACsec device
>>    * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
>> + * @IFF_FAILOVER: device is a failover master device
>> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>>    */
>>   enum netdev_priv_flags {
>>   	IFF_802_1Q_VLAN			= 1<<0,
>> @@ -1430,6 +1432,8 @@ enum netdev_priv_flags {
>>   	IFF_PHONY_HEADROOM		= 1<<24,
>>   	IFF_MACSEC			= 1<<25,
>>   	IFF_NO_RX_HANDLER		= 1<<26,
>> +	IFF_FAILOVER			= 1<<27,
>> +	IFF_FAILOVER_SLAVE		= 1<<28,
>>   };
>>   
>>   #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> @@ -1458,6 +1462,8 @@ enum netdev_priv_flags {
>>   #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>>   #define IFF_MACSEC			IFF_MACSEC
>>   #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> +#define IFF_FAILOVER			IFF_FAILOVER
>> +#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
>>   
>>   /**
>>    *	struct net_device - The DEVICE structure.
>> @@ -4308,6 +4314,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>>   	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>>   }
>>   
>> +static inline bool netif_is_failover(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_FAILOVER;
>> +}
>> +
>> +static inline bool netif_is_failover_slave(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_FAILOVER_SLAVE;
>> +}
>> +
>>   /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>>   static inline void netif_keep_dst(struct net_device *dev)
>>   {
>> diff --git a/include/net/failover.h b/include/net/failover.h
>> new file mode 100644
>> index 000000000000..0b8601043d90
>> --- /dev/null
>> +++ b/include/net/failover.h
>> @@ -0,0 +1,96 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_FAILOVER_H
>> +#define _NET_FAILOVER_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct failover_ops {
>> +	int (*slave_pre_register)(struct net_device *slave_dev,
>> +				  struct net_device *failover_dev);
>> +	int (*slave_join)(struct net_device *slave_dev,
>> +			  struct net_device *failover_dev);
>> +	int (*slave_pre_unregister)(struct net_device *slave_dev,
>> +				    struct net_device *failover_dev);
>> +	int (*slave_release)(struct net_device *slave_dev,
>> +			     struct net_device *failover_dev);
>> +	int (*slave_link_change)(struct net_device *slave_dev,
>> +				 struct net_device *failover_dev);
>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct failover {
>> +	struct list_head list;
>> +	struct net_device __rcu *failover_dev;
>> +	struct failover_ops __rcu *ops;
>> +};
>> +
>> +/* failover state */
>> +struct failover_info {
>> +	/* primary netdev with same MAC */
>> +	struct net_device __rcu *primary_dev;
>> +
>> +	/* standby netdev */
>> +	struct net_device __rcu *standby_dev;
>> +
>> +	/* primary netdev stats */
>> +	struct rtnl_link_stats64 primary_stats;
>> +
>> +	/* standby netdev stats */
>> +	struct rtnl_link_stats64 standby_stats;
>> +
>> +	/* aggregated stats */
>> +	struct rtnl_link_stats64 failover_stats;
>> +
>> +	/* spinlock while updating stats */
>> +	spinlock_t stats_lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_FAILOVER)
>> +
>> +int failover_create(struct net_device *standby_dev,
>> +		    struct failover **pfailover);
>> +void failover_destroy(struct failover *failover);
>> +
>> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
>> +		      struct failover **pfailover);
>> +void failover_unregister(struct failover *failover);
>> +
>> +int failover_slave_unregister(struct net_device *slave_dev);
>> +
>> +#else
>> +
>> +static inline
>> +int failover_create(struct net_device *standby_dev,
>> +		    struct failover **pfailover);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void failover_destroy(struct failover *failover)
>> +{
>> +}
>> +
>> +static inline
>> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
>> +		      struct pfailover **pfailover);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void failover_unregister(struct failover *failover)
>> +{
>> +}
>> +
>> +static inline
>> +int failover_slave_unregister(struct net_device *slave_dev)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_FAILOVER_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..388b99dfee10 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>>   	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>>   	  devlink is a loadable module and the driver using it is built-in.
>>   
>> +config NET_FAILOVER
>> +	tristate "Failover interface"
>> +	help
>> +	  This provides a generic interface for paravirtual drivers to listen
>> +	  for netdev register/unregister/link change events from pci ethernet
>> +	  devices with the same MAC and takeover their datapath. This also
>> +	  enables live migration of a VM with direct attached VF by failing
>> +	  over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_FAILOVER
>> +	tristate
>> +	default m if NET_FAILOVER=m
>> +	default y if NET_FAILOVER=y || NET_FAILOVER=n
>> +	help
>> +	  Drivers using the failover infrastructure should have a dependency
>> +	  on MAY_USE_FAILOVER to ensure they do not cause link errors when
>> +	  failover is a loadable module and the driver using it is built-in.
>> +
>>   endif   # if NET
>>   
>>   # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..cef17518bb7d 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>>   obj-$(CONFIG_HWBM) += hwbm.o
>>   obj-$(CONFIG_NET_DEVLINK) += devlink.o
>>   obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_FAILOVER) += failover.o
>> diff --git a/net/core/failover.c b/net/core/failover.c
>> new file mode 100644
>> index 000000000000..7bee762cb737
>> --- /dev/null
>> +++ b/net/core/failover.c
>> @@ -0,0 +1,844 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>
> I think you should copy the header from bond_main.c upon which
> some of the code seems to be based.

Yes. some of the code is based on bonding/team and netvsc. i added a reference
to netvsc in the comment, will also include bonding/team driver.

>
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/pci.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/failover.h>
>> +
>> +static LIST_HEAD(failover_list);
>> +static DEFINE_SPINLOCK(failover_lock);
>> +
>> +static int failover_slave_pre_register(struct net_device *slave_dev,
>> +				       struct net_device *failover_dev,
>> +				       struct failover_ops *failover_ops)
>> +{
>> +	struct failover_info *finfo;
>> +	bool standby;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_pre_register)
>> +			return -EINVAL;
>> +
>> +		return failover_ops->slave_pre_register(slave_dev,
>> +							failover_dev);
>> +	}
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +	standby = (slave_dev->dev.parent == failover_dev->dev.parent);
>> +	if (standby ? rtnl_dereference(finfo->standby_dev) :
>> +			rtnl_dereference(finfo->primary_dev)) {
>> +		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
>> +			   slave_dev->name, standby ? "standby" : "primary");
>> +		return -EEXIST;
>> +	}
>> +
>> +	/* Avoid non pci devices as primary netdev */
> Why? Pls change this comment so it explains the motivation
> rather than just repeat what the code does.

OK.

>
>> +	if (!standby && (!slave_dev->dev.parent ||
>> +			 !dev_is_pci(slave_dev->dev.parent)))
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int failover_slave_join(struct net_device *slave_dev,
>> +			       struct net_device *failover_dev,
>> +			       struct failover_ops *failover_ops)
>> +{
>> +	struct failover_info *finfo;
>> +	int err, orig_mtu;
>> +	bool standby;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_join)
>> +			return -EINVAL;
>> +
>> +		return failover_ops->slave_join(slave_dev, failover_dev);
>> +	}
>> +
>> +	if (netif_running(failover_dev)) {
>> +		err = dev_open(slave_dev);
>> +		if (err && (err != -EBUSY)) {
>> +			netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
>> +				   slave_dev->name, err);
>> +			goto err_dev_open;
>> +		}
>> +	}
>> +
>> +	/* Align MTU of slave with failover dev */
>> +	orig_mtu = slave_dev->mtu;
> I suspect this was copied from bond. this variable is never
> used and I'm even surprised gcc did not warn about this.

Thanks for catching, I broke this when i moved the dev_open() and dev_set_mtu()
calls from register to join. I need to reset slave_dev mtu to orig_mtu on failure.


>
>
>> +	err = dev_set_mtu(slave_dev, failover_dev->mtu);
> How do we know slave supports this MTU? same applies to
> failover_change_mtu.

The err check below should catch it and we will reset the mtu back and
fail the join/register.

>
>
>
>
>> +	if (err) {
>> +		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
>> +			   slave_dev->name, failover_dev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +	standby = (slave_dev->dev.parent == failover_dev->dev.parent);
>> +
>> +	dev_hold(slave_dev);
>> +
>> +	if (standby) {
>> +		rcu_assign_pointer(finfo->standby_dev, slave_dev);
>> +		dev_get_stats(finfo->standby_dev, &finfo->standby_stats);
>> +	} else {
>> +		rcu_assign_pointer(finfo->primary_dev, slave_dev);
>> +		dev_get_stats(finfo->primary_dev, &finfo->primary_stats);
>> +		failover_dev->min_mtu = slave_dev->min_mtu;
>> +		failover_dev->max_mtu = slave_dev->max_mtu;
>> +	}
>> +
>> +	netdev_info(failover_dev, "failover slave:%s joined\n",
>> +		    slave_dev->name);
>> +
>> +	return 0;
>> +
>> +err_set_mtu:
>> +	dev_close(slave_dev);
>> +err_dev_open:
>> +	return err;
>> +}
>> +
>> +/* Called when slave dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t failover_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> +	skb->dev = ndev;
>> +
>> +	return RX_HANDLER_ANOTHER;
>> +}
>> +
>> +static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
>> +{
>> +	struct net_device *failover_dev;
>> +	struct failover *failover;
>> +
>> +	spin_lock(&failover_lock);
>> +	list_for_each_entry(failover, &failover_list, list) {
>> +		failover_dev = rtnl_dereference(failover->failover_dev);
>> +		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
>> +			*ops = rtnl_dereference(failover->ops);
>> +			spin_unlock(&failover_lock);
>> +			return failover_dev;
>> +		}
>> +	}
>> +	spin_unlock(&failover_lock);
>> +	return NULL;
>> +}
>> +
>> +static int failover_slave_register(struct net_device *slave_dev)
>> +{
>> +	struct failover_ops *failover_ops;
>> +	struct net_device *failover_dev;
>> +	int ret;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> +	if (!failover_dev)
>> +		goto done;
>> +
>> +	ret = failover_slave_pre_register(slave_dev, failover_dev,
>> +					  failover_ops);
>> +	if (ret)
>> +		goto done;
>> +
>> +	ret = netdev_rx_handler_register(slave_dev, failover_ops ?
>> +					 failover_ops->handle_frame :
>> +					 failover_handle_frame, failover_dev);
>> +	if (ret) {
>> +		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
>> +			   ret);
>> +		goto done;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> +	if (ret) {
>> +		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
>> +			   failover_dev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
>> +
>> +	ret = failover_slave_join(slave_dev, failover_dev, failover_ops);
>> +	if (ret)
>> +		goto err_join;
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
>> +
>> +	netdev_info(failover_dev, "failover slave:%s registered\n",
>> +		    slave_dev->name);
>> +
>> +	goto done;
>> +
>> +err_join:
>> +	netdev_upper_dev_unlink(slave_dev, failover_dev);
>> +	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(slave_dev);
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int failover_slave_pre_unregister(struct net_device *slave_dev,
>> +					 struct net_device *failover_dev,
>> +					 struct failover_ops *failover_ops)
>> +{
>> +	struct net_device *standby_dev, *primary_dev;
>> +	struct failover_info *finfo;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_pre_unregister)
>> +			return -EINVAL;
>> +
>> +		return failover_ops->slave_pre_unregister(slave_dev,
>> +							  failover_dev);
>> +	}
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +	primary_dev = rtnl_dereference(finfo->primary_dev);
>> +	standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> +	if (slave_dev != primary_dev && slave_dev != standby_dev)
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int failover_slave_release(struct net_device *slave_dev,
>> +				  struct net_device *failover_dev,
>> +				  struct failover_ops *failover_ops)
>> +{
>> +	struct net_device *standby_dev, *primary_dev;
>> +	struct failover_info *finfo;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_release)
>> +			return -EINVAL;
>> +
>> +		return failover_ops->slave_release(slave_dev, failover_dev);
>> +	}
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +	primary_dev = rtnl_dereference(finfo->primary_dev);
>> +	standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> +	if (slave_dev == standby_dev) {
>> +		RCU_INIT_POINTER(finfo->standby_dev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(finfo->primary_dev, NULL);
>> +		if (standby_dev) {
>> +			failover_dev->min_mtu = standby_dev->min_mtu;
>> +			failover_dev->max_mtu = standby_dev->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(slave_dev);
>> +
>> +	netdev_info(failover_dev, "failover slave:%s released\n",
>> +		    slave_dev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +int failover_slave_unregister(struct net_device *slave_dev)
>> +{
>> +	struct failover_ops *failover_ops;
>> +	struct net_device *failover_dev;
>> +	int ret;
>> +
>> +	if (!netif_is_failover_slave(slave_dev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> +	if (!failover_dev)
>> +		goto done;
>> +
>> +	ret = failover_slave_pre_unregister(slave_dev, failover_dev,
>> +					    failover_ops);
>> +	if (ret)
>> +		goto done;
>> +
>> +	netdev_rx_handler_unregister(slave_dev);
>> +	netdev_upper_dev_unlink(slave_dev, failover_dev);
>> +	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
>> +
>> +	failover_slave_release(slave_dev, failover_dev, failover_ops);
>
> Don't you need to get stats from it? This device is going away ...

Yes. we need to update the failover_stats before the slave goes away.

>
>> +
>> +	netdev_info(failover_dev, "failover slave:%s unregistered\n",
>> +		    slave_dev->name);
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_slave_unregister);
>> +
>> +static bool failover_xmit_ready(struct net_device *dev)
>> +{
>> +	return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static int failover_slave_link_change(struct net_device *slave_dev)
>> +{
>> +	struct net_device *failover_dev, *primary_dev, *standby_dev;
>> +	struct failover_ops *failover_ops;
>> +	struct failover_info *finfo;
>> +
>> +	if (!netif_is_failover_slave(slave_dev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> +	if (!failover_dev)
>> +		goto done;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_link_change)
>> +			goto done;
>> +
>> +		return failover_ops->slave_link_change(slave_dev, failover_dev);
>> +	}
>> +
>> +	if (!netif_running(failover_dev))
>> +		return 0;
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +
>> +	primary_dev = rtnl_dereference(finfo->primary_dev);
>> +	standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> +	if (slave_dev != primary_dev && slave_dev != standby_dev)
>> +		goto done;
>> +
>> +	if ((primary_dev && failover_xmit_ready(primary_dev)) ||
>> +	    (standby_dev && failover_xmit_ready(standby_dev))) {
>> +		netif_carrier_on(failover_dev);
>> +		netif_tx_wake_all_queues(failover_dev);
>> +	} else {
>> +		netif_carrier_off(failover_dev);
>> +		netif_tx_stop_all_queues(failover_dev);
> And I think it's a good idea to get stats from device here too.

Not sure why we need to get stats from lower devs here?


>
>
>> +	}
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static bool failover_validate_event_dev(struct net_device *dev)
>> +{
>> +	/* Skip parent events */
>> +	if (netif_is_failover(dev))
>> +		return false;
>> +
>> +	/* Avoid non-Ethernet type devices */
> ... for now. It would be possible easy to make this generic -
> just copy things like type and addr_len from slave.

ok. failover_create can copy these values from standby_dev and
we can validate that they match when primary_dev registers.

>
>> +	if (dev->type != ARPHRD_ETHER)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int
>> +failover_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	if (!failover_validate_event_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return failover_slave_register(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return failover_slave_unregister(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return failover_slave_link_change(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
>> +
>> +static struct notifier_block failover_notifier = {
>> +	.notifier_call = failover_event,
>> +};
>> +
>> +static int failover_open(struct net_device *dev)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *primary_dev, *standby_dev;
>> +	int err;
>> +
>> +	netif_carrier_off(dev);
>> +	netif_tx_wake_all_queues(dev);
>> +
>> +	primary_dev = rtnl_dereference(finfo->primary_dev);
>> +	if (primary_dev) {
>> +		err = dev_open(primary_dev);
>> +		if (err)
>> +			goto err_primary_open;
>> +	}
>> +
>> +	standby_dev = rtnl_dereference(finfo->standby_dev);
>> +	if (standby_dev) {
>> +		err = dev_open(standby_dev);
>> +		if (err)
>> +			goto err_standby_open;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_standby_open:
>> +	dev_close(primary_dev);
>> +err_primary_open:
>> +	netif_tx_disable(dev);
>> +	return err;
>> +}
>> +
>> +static int failover_close(struct net_device *dev)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *slave_dev;
>> +
>> +	netif_tx_disable(dev);
>> +
>> +	slave_dev = rtnl_dereference(finfo->primary_dev);
>> +	if (slave_dev)
>> +		dev_close(slave_dev);
>> +
>> +	slave_dev = rtnl_dereference(finfo->standby_dev);
>> +	if (slave_dev)
>> +		dev_close(slave_dev);
>> +
>> +	return 0;
>> +}
>> +
>> +static netdev_tx_t failover_drop_xmit(struct sk_buff *skb,
>> +				      struct net_device *dev)
>> +{
>> +	atomic_long_inc(&dev->tx_dropped);
>> +	dev_kfree_skb_any(skb);
>> +	return NETDEV_TX_OK;
>> +}
>> +
>> +static netdev_tx_t failover_start_xmit(struct sk_buff *skb,
>> +				       struct net_device *dev)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *xmit_dev;
>> +
>> +	/* Try xmit via primary netdev followed by standby netdev */
>> +	xmit_dev = rcu_dereference_bh(finfo->primary_dev);
>> +	if (!xmit_dev || !failover_xmit_ready(xmit_dev)) {
>> +		xmit_dev = rcu_dereference_bh(finfo->standby_dev);
>> +		if (!xmit_dev || !failover_xmit_ready(xmit_dev))
>> +			return failover_drop_xmit(skb, dev);
>> +	}
>> +
>> +	skb->dev = xmit_dev;
>> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> +	return dev_queue_xmit(skb);
>> +}
> Is this going through qdisc twice? Won't this hurt performance
> measureably?

The failover dev has no queue (IFF_NO_QUEUE) , so doesn't go through qdisc twice.

>
>> +
>> +static u16 failover_select_queue(struct net_device *dev, struct sk_buff *skb,
>> +				 void *accel_priv,
>> +				 select_queue_fallback_t fallback)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *primary_dev;
>> +	u16 txq;
>> +
>> +	rcu_read_lock();
>> +	primary_dev = rcu_dereference(finfo->primary_dev);
>> +	if (primary_dev) {
>> +		const struct net_device_ops *ops = primary_dev->netdev_ops;
>> +
>> +		if (ops->ndo_select_queue)
>> +			txq = ops->ndo_select_queue(primary_dev, skb,
>> +						    accel_priv, fallback);
>> +		else
>> +			txq = fallback(primary_dev, skb);
>> +
>> +		qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +		return txq;
>> +	}
>> +
>> +	txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> +	/* Save the original txq to restore before passing to the driver */
>> +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +	if (unlikely(txq >= dev->real_num_tx_queues)) {
>> +		do {
>> +			txq -= dev->real_num_tx_queues;
>> +		} while (txq >= dev->real_num_tx_queues);
>> +	}
>> +
>> +	return txq;
>> +}
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32finfot values only.
>> + */
>> +static void failover_fold_stats(struct rtnl_link_stats64 *_res,
>> +				const struct rtnl_link_stats64 *_new,
>> +				const struct rtnl_link_stats64 *_old)
>> +{
>> +	const u64 *new = (const u64 *)_new;
>> +	const u64 *old = (const u64 *)_old;
>> +	u64 *res = (u64 *)_res;
>> +	int i;
>> +
>> +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> +		u64 nv = new[i];
>> +		u64 ov = old[i];
>> +		s64 delta = nv - ov;
>> +
>> +		/* detects if this particular field is 32bit only */
>> +		if (((nv | ov) >> 32) == 0)
>> +			delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> +		/* filter anomalies, some drivers reset their stats
>> +		 * at down/up events.
>> +		 */
>> +		if (delta > 0)
>> +			res[i] += delta;
>> +	}
>> +}
>> +
>> +static void failover_get_stats(struct net_device *dev,
>> +			       struct rtnl_link_stats64 *stats)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	const struct rtnl_link_stats64 *new;
>> +	struct rtnl_link_stats64 temp;
>> +	struct net_device *slave_dev;
>> +
>> +	spin_lock(&finfo->stats_lock);
>> +	memcpy(stats, &finfo->failover_stats, sizeof(*stats));
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_dev = rcu_dereference(finfo->primary_dev);
>> +	if (slave_dev) {
>> +		new = dev_get_stats(slave_dev, &temp);
>> +		failover_fold_stats(stats, new, &finfo->primary_stats);
>> +		memcpy(&finfo->primary_stats, new, sizeof(*new));
>> +	}
>> +
>> +	slave_dev = rcu_dereference(finfo->standby_dev);
>> +	if (slave_dev) {
>> +		new = dev_get_stats(slave_dev, &temp);
>> +		failover_fold_stats(stats, new, &finfo->standby_stats);
>> +		memcpy(&finfo->standby_stats, new, sizeof(*new));
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	memcpy(&finfo->failover_stats, stats, sizeof(*stats));
>> +	spin_unlock(&finfo->stats_lock);
>> +}
>> +
>> +static int failover_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *primary_dev, *standby_dev;
>> +	int ret = 0;
>> +
>> +	primary_dev = rcu_dereference(finfo->primary_dev);
>> +	if (primary_dev) {
>> +		ret = dev_set_mtu(primary_dev, new_mtu);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	standby_dev = rcu_dereference(finfo->standby_dev);
>> +	if (standby_dev) {
>> +		ret = dev_set_mtu(standby_dev, new_mtu);
>> +		if (ret) {
>> +			dev_set_mtu(primary_dev, dev->mtu);
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	dev->mtu = new_mtu;
>> +	return 0;
>> +}
>> +
>> +static void failover_set_rx_mode(struct net_device *dev)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *slave_dev;
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_dev = rcu_dereference(finfo->primary_dev);
>> +	if (slave_dev) {
>> +		dev_uc_sync_multiple(slave_dev, dev);
>> +		dev_mc_sync_multiple(slave_dev, dev);
>> +	}
>> +
>> +	slave_dev = rcu_dereference(finfo->standby_dev);
>> +	if (slave_dev) {
>> +		dev_uc_sync_multiple(slave_dev, dev);
>> +		dev_mc_sync_multiple(slave_dev, dev);
>> +	}
>> +
>> +	rcu_read_unlock();
>> +}
>> +
>> +static const struct net_device_ops failover_dev_ops = {
>> +	.ndo_open		= failover_open,
>> +	.ndo_stop		= failover_close,
>> +	.ndo_start_xmit		= failover_start_xmit,
>> +	.ndo_select_queue	= failover_select_queue,
>> +	.ndo_get_stats64	= failover_get_stats,
>> +	.ndo_change_mtu		= failover_change_mtu,
>> +	.ndo_set_rx_mode	= failover_set_rx_mode,
>> +	.ndo_validate_addr	= eth_validate_addr,
>> +	.ndo_features_check	= passthru_features_check,
> xdp support?

I think it should be possible to add it be calling the lower dev ndo_xdp routines
with proper checks. can we add this later?

>
>> +};
>> +
>> +#define FAILOVER_NAME "failover"
>> +#define FAILOVER_VERSION "0.1"
>> +
>> +static void failover_ethtool_get_drvinfo(struct net_device *dev,
>> +					 struct ethtool_drvinfo *drvinfo)
>> +{
>> +	strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
>> +	strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +int failover_ethtool_get_link_ksettings(struct net_device *dev,
>> +					struct ethtool_link_ksettings *cmd)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *slave_dev;
>> +
>> +	slave_dev = rtnl_dereference(finfo->primary_dev);
>> +	if (!slave_dev || !failover_xmit_ready(slave_dev)) {
>> +		slave_dev = rtnl_dereference(finfo->standby_dev);
>> +		if (!slave_dev || !failover_xmit_ready(slave_dev)) {
>> +			cmd->base.duplex = DUPLEX_UNKNOWN;
>> +			cmd->base.port = PORT_OTHER;
>> +			cmd->base.speed = SPEED_UNKNOWN;
>> +
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	return __ethtool_get_link_ksettings(slave_dev, cmd);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_ethtool_get_link_ksettings);
>> +
>> +static const struct ethtool_ops failover_ethtool_ops = {
>> +	.get_drvinfo            = failover_ethtool_get_drvinfo,
>> +	.get_link               = ethtool_op_get_link,
>> +	.get_link_ksettings     = failover_ethtool_get_link_ksettings,
>> +};
>> +
>> +static void failover_register_existing_slave(struct net_device *failover_dev)
>> +{
>> +	struct net *net = dev_net(failover_dev);
>> +	struct net_device *dev;
>> +
>> +	rtnl_lock();
>> +	for_each_netdev(net, dev) {
>> +		if (dev == failover_dev)
>> +			continue;
>> +		if (!failover_validate_event_dev(dev))
>> +			continue;
>> +		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
>> +			failover_slave_register(dev);
>> +	}
>> +	rtnl_unlock();
>> +}
>> +
>> +int failover_register(struct net_device *dev, struct failover_ops *ops,
>> +		      struct failover **pfailover)
>> +{
>> +	struct failover *failover;
>> +
>> +	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
>> +	if (!failover)
>> +		return -ENOMEM;
>> +
>> +	rcu_assign_pointer(failover->ops, ops);
>> +	dev_hold(dev);
>> +	dev->priv_flags |= IFF_FAILOVER;
>> +	rcu_assign_pointer(failover->failover_dev, dev);
>> +
>> +	spin_lock(&failover_lock);
>> +	list_add_tail(&failover->list, &failover_list);
>> +	spin_unlock(&failover_lock);
>> +
>> +	failover_register_existing_slave(dev);
>> +
>> +	*pfailover = failover;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_register);
>> +
>> +void failover_unregister(struct failover *failover)
>> +{
>> +	struct net_device *failover_dev;
>> +
>> +	failover_dev = rcu_dereference(failover->failover_dev);
>> +
>> +	failover_dev->priv_flags &= ~IFF_FAILOVER;
>> +	dev_put(failover_dev);
>> +
>> +	spin_lock(&failover_lock);
>> +	list_del(&failover->list);
>> +	spin_unlock(&failover_lock);
>> +
>> +	kfree(failover);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_unregister);
>> +
>> +int failover_create(struct net_device *standby_dev, struct failover **pfailover)
>> +{
>> +	struct device *dev = standby_dev->dev.parent;
>> +	struct net_device *failover_dev;
>> +	int err;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	failover_dev = alloc_etherdev_mq(sizeof(struct failover_info), 16);
>> +	if (!failover_dev) {
>> +		dev_err(dev, "Unable to allocate failover_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(failover_dev, dev_net(standby_dev));
>> +	SET_NETDEV_DEV(failover_dev, dev);
>> +
>> +	failover_dev->netdev_ops = &failover_dev_ops;
>> +	failover_dev->ethtool_ops = &failover_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> +	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> +				       IFF_TX_SKB_SHARING);
>> +
>> +	/* don't acquire failover netdev's netif_tx_lock when transmitting */
>> +	failover_dev->features |= NETIF_F_LLTX;
>> +
>> +	/* Don't allow failover devices to change network namespaces. */
>> +	failover_dev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> +	failover_dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> +				    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> +				    NETIF_F_HIGHDMA | NETIF_F_LRO;
> OK but then you must make sure your primary and standby both
> support these features.

I guess i need to add something like bond_compute_features() to handle this.

>
>> +
>> +	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> +	failover_dev->features |= failover_dev->hw_features;
>> +
>> +	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
>> +	       failover_dev->addr_len);
>> +
>> +	failover_dev->min_mtu = standby_dev->min_mtu;
>> +	failover_dev->max_mtu = standby_dev->max_mtu;
> OK MTU is copied, fine. But is this always enough?
>
> How about e.g. hard_header_len? min_header_len? needed_headroom?
> needed_tailroom? I'd worry that even if you cover existing ones more
> might be added with time.  A function copying config between devices
> probably belongs in some central place IMHO.

ok. will add a function that handles these fields too.

>
>
>> +
>> +	err = register_netdev(failover_dev);
>> +	if (err < 0) {
>> +		dev_err(dev, "Unable to register failover_dev!\n");
>> +		goto err_register_netdev;
>> +	}
>> +
>> +	netif_carrier_off(failover_dev);
>> +
>> +	err = failover_register(failover_dev, NULL, pfailover);
>> +	if (err < 0)
>> +		goto err_failover;
>> +
>> +	return 0;
>> +
>> +err_failover:
>> +	unregister_netdev(failover_dev);
>> +err_register_netdev:
>> +	free_netdev(failover_dev);
>> +
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_create);
>> +
>> +void failover_destroy(struct failover *failover)
>> +{
>> +	struct net_device *failover_dev;
>> +	struct net_device *slave_dev;
>> +	struct failover_info *finfo;
>> +
>> +	if (!failover)
>> +		return;
>> +
>> +	failover_dev = rcu_dereference(failover->failover_dev);
>> +	finfo = netdev_priv(failover_dev);
>> +
>> +	netif_device_detach(failover_dev);
>> +
>> +	rtnl_lock();
>> +
>> +	slave_dev = rtnl_dereference(finfo->primary_dev);
>> +	if (slave_dev)
>> +		failover_slave_unregister(slave_dev);
>> +
>> +	slave_dev = rtnl_dereference(finfo->standby_dev);
>> +	if (slave_dev)
>> +		failover_slave_unregister(slave_dev);
>> +
>> +	failover_unregister(failover);
>> +
>> +	unregister_netdevice(failover_dev);
>> +
>> +	rtnl_unlock();
>> +
>> +	free_netdev(failover_dev);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_destroy);
>> +
>> +static __init int
>> +failover_init(void)
>> +{
>> +	register_netdevice_notifier(&failover_notifier);
>> +
>> +	return 0;
>> +}
>> +module_init(failover_init);
>> +
>> +static __exit
>> +void failover_exit(void)
>> +{
>> +	unregister_netdevice_notifier(&failover_notifier);
>> +}
>> +module_exit(failover_exit);
>> +
>> +MODULE_DESCRIPTION("Failover infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> -- 
>> 2.14.3

^ permalink raw reply

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Michael S. Tsirkin @ 2018-04-20 14:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: alexander.h.duyck, virtio-dev, kvm, netdev, linux-kernel,
	virtualization, xiao.w.wang, ddutile, jianfeng.tan, zhihong.wang
In-Reply-To: <060e2b5f-2e93-c53f-387b-5baaa33e87cd@redhat.com>

On Fri, Apr 20, 2018 at 11:52:47AM +0800, Jason Wang wrote:
> > The biggest issue is that you let userspace poke at the
> > device which is also allowed by the IOMMU to poke at
> > kernel memory (needed for kernel driver to work).
> 
> I don't quite get. The userspace driver could be built on top of VFIO for
> sure. So kernel memory were perfectly isolated in this case.

VFIO does what it can but it mostly just has the IOMMU to play with.
So don't overestimate what it can do - it assumes a high level
of spec compliance for protections to work. For example,
ATS is enabled by default if device has it, and that
treats translated requests are trusted. FLS is assumed to reset
the device for when VFIO is unbound from the device. etc.


> > 
> > Yes, maybe if device is not buggy it's all fine, but
> > it's better if we do not have to trust the device
> > otherwise the security picture becomes more murky.
> > 
> > I suggested attaching a PASID to (some) queues - see my old post "using
> > PASIDs to enable a safe variant of direct ring access".
> > 
> > Then using IOMMU with VFIO to limit access through queue to corrent
> > ranges of memory.
> 
> Well userspace driver could benefit from this too. And we can even go
> further by using nested IO page tables to share IOVA address space between
> devices and a VM.
> 
> Thanks

Yes I suggested this separately.

-- 
MST

^ permalink raw reply

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Michael S. Tsirkin @ 2018-04-20 13:52 UTC (permalink / raw)
  To: Liang, Cunming
  Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
	kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, Wang, Xiao W,
	ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong
In-Reply-To: <D0158A423229094DA7ABF71CF2FA0DA34E9511D5@SHSMSX104.ccr.corp.intel.com>

On Fri, Apr 20, 2018 at 03:50:41AM +0000, Liang, Cunming wrote:
> 
> 
> > -----Original Message-----
> > From: Bie, Tiwei
> > Sent: Friday, April 20, 2018 11:28 AM
> > To: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Jason Wang <jasowang@redhat.com>; alex.williamson@redhat.com;
> > ddutile@redhat.com; Duyck, Alexander H <alexander.h.duyck@intel.com>;
> > virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org;
> > kvm@vger.kernel.org; virtualization@lists.linux-foundation.org;
> > netdev@vger.kernel.org; Daly, Dan <dan.daly@intel.com>; Liang, Cunming
> > <cunming.liang@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Tan,
> > Jianfeng <jianfeng.tan@intel.com>; Wang, Xiao W <xiao.w.wang@intel.com>;
> > Tian, Kevin <kevin.tian@intel.com>
> > Subject: Re: [RFC] vhost: introduce mdev based hardware vhost backend
> > 
> > On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > > > One problem is that, different virtio ring compatible devices
> > > > > > > may have different device interfaces. That is to say, we will
> > > > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > > > that's what this patch trying to fix. The idea behind this
> > > > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > > > in kernel.
> > > > > > So you just move the abstraction layer from qemu to kernel, and
> > > > > > you still need different drivers in kernel for different device
> > > > > > interfaces of accelerators. This looks even more complex than
> > > > > > leaving it in qemu. As you said, another idea is to implement
> > > > > > userspace vhost backend for accelerators which seems easier and
> > > > > > could co-work with other parts of qemu without inventing new type of
> > messages.
> > > > > I'm not quite sure. Do you think it's acceptable to add various
> > > > > vendor specific hardware drivers in QEMU?
> > > > >
> > > >
> > > > I don't object but we need to figure out the advantages of doing it
> > > > in qemu too.
> > > >
> > > > Thanks
> > >
> > > To be frank kernel is exactly where device drivers belong.  DPDK did
> > > move them to userspace but that's merely a requirement for data path.
> > > *If* you can have them in kernel that is best:
> > > - update kernel and there's no need to rebuild userspace
> > > - apps can be written in any language no need to maintain multiple
> > >   libraries or add wrappers
> > > - security concerns are much smaller (ok people are trying to
> > >   raise the bar with IOMMUs and such, but it's already pretty
> > >   good even without)
> > >
> > > The biggest issue is that you let userspace poke at the device which
> > > is also allowed by the IOMMU to poke at kernel memory (needed for
> > > kernel driver to work).
> > 
> > I think the device won't and shouldn't be allowed to poke at kernel memory. Its
> > kernel driver needs some kernel memory to work. But the device doesn't have
> > the access to them. Instead, the device only has the access to:
> > 
> > (1) the entire memory of the VM (if vIOMMU isn't used) or
> > (2) the memory belongs to the guest virtio device (if
> >     vIOMMU is being used).
> > 
> > Below is the reason:
> > 
> > For the first case, we should program the IOMMU for the hardware device based
> > on the info in the memory table which is the entire memory of the VM.
> > 
> > For the second case, we should program the IOMMU for the hardware device
> > based on the info in the shadow page table of the vIOMMU.
> > 
> > So the memory can be accessed by the device is limited, it should be safe
> > especially for the second case.
> > 
> > My concern is that, in this RFC, we don't program the IOMMU for the mdev
> > device in the userspace via the VFIO API directly. Instead, we pass the memory
> > table to the kernel driver via the mdev device (BAR0) and ask the driver to do the
> > IOMMU programming. Someone may don't like it. The main reason why we don't
> > program IOMMU via VFIO API in userspace directly is that, currently IOMMU
> > drivers don't support mdev bus.
> > 
> > >
> > > Yes, maybe if device is not buggy it's all fine, but it's better if we
> > > do not have to trust the device otherwise the security picture becomes
> > > more murky.
> > >
> > > I suggested attaching a PASID to (some) queues - see my old post
> > > "using PASIDs to enable a safe variant of direct ring access".
> > 
> Ideally we can have a device binding with normal driver in host, meanwhile support to allocate a few queues attaching with PASID on-demand. By vhost mdev transport channel, the data path ability of queues(as a device) can expose to qemu vhost adaptor as a vDPA instance. Then we can avoid VF number limitation, providing vhost data path acceleration in a small granularity.

Exactly my point.

> > It's pretty cool. We also have some similar ideas.
> > Cunming will talk more about this.
> > 
> > Best regards,
> > Tiwei Bie
> > 
> > >
> > > Then using IOMMU with VFIO to limit access through queue to corrent
> > > ranges of memory.
> > >
> > >
> > > --
> > > MST

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-20 13:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, Mikulas Patocka, bhutchings, Andrew Morton,
	David Miller, Vlastimil Babka
In-Reply-To: <20180420134136.GD10788@bombadil.infradead.org>

On Fri 20-04-18 06:41:36, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > 
> > No way. This is just wrong! First of all, you will explode most likely
> > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > enabled quite often.
> 
> I think it'll still suit Mikulas' debugging needs if we always use
> vmalloc for sizes above PAGE_SIZE?

Even if that was the case then this doesn't sounds like CONFIG_DEBUG_VM
material. We do not want a completely different behavior when the config
is enabled. If we really need some better fallback testing coverage
then the fault injection, as suggested by Vlastimil, sounds much more
reasonable to me

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Matthew Wilcox @ 2018-04-20 13:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, Mikulas Patocka, bhutchings, Andrew Morton,
	David Miller, Vlastimil Babka
In-Reply-To: <20180420130852.GC16083@dhcp22.suse.cz>

On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> 
> No way. This is just wrong! First of all, you will explode most likely
> on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> enabled quite often.

I think it'll still suit Mikulas' debugging needs if we always use
vmalloc for sizes above PAGE_SIZE?

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-20 13:08 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, bhutchings, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804191207380.31175@file01.intranet.prod.int.rdu2.redhat.com>

On Thu 19-04-18 12:12:38, Mikulas Patocka wrote:
[...]
> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
> 
> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
> 
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.
> 
> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.

No way. This is just wrong! First of all, you will explode most likely
on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
enabled quite often.

> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

Nacked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/util.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> Index: linux-2.6/mm/util.c
> ===================================================================
> --- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
> +++ linux-2.6/mm/util.c	2018-04-18 16:00:43.000000000 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> +#ifndef CONFIG_DEBUG_VM
>  	gfp_t kmalloc_flags = flags;
>  	void *ret;
>  
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
>  	 */
>  	if (ret || size <= PAGE_SIZE)
>  		return ret;
> +#endif
>  
>  	return __vmalloc_node_flags_caller(size, node, flags,
>  			__builtin_return_address(0));

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-20 12:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, Andrew Morton, David Miller, Vlastimil Babka
In-Reply-To: <20180420114712.GB10788@bombadil.infradead.org>



On Fri, 20 Apr 2018, Matthew Wilcox wrote:

> On Thu, Apr 19, 2018 at 12:12:38PM -0400, Mikulas Patocka wrote:
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> 
> Maybe it's time to have the SG code handle vmalloced pages?  This is
> becoming more and more common with vmapped stacks (and some of our
> workarounds are hideous -- allocate 4 bytes with kmalloc because we can't
> DMA onto the stack any more?).  We already have a few places which do
> handle sgs of vmalloced addresses, such as the nx crypto driver:
> 
>         if (is_vmalloc_addr(start_addr))
>                 sg_addr = page_to_phys(vmalloc_to_page(start_addr))
>                           + offset_in_page(sg_addr);
>         else
>                 sg_addr = __pa(sg_addr);
> 
> and videobuf:
> 
>                 pg = vmalloc_to_page(virt);
>                 if (NULL == pg)
>                         goto err;
>                 BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
>                 sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
> 
> Yes, there's the potential that we have to produce two SG entries for a
> virtually contiguous region if it crosses a page boundary, and our APIs
> aren't set up right to make it happen.  But this is something we should
> consider fixing ... otherwise we'll end up with dozens of driver hacks.
> The videobuf implementation was already copy-and-pasted into the saa7146
> driver, for example.

What if the device requires physically contiguous area and the vmalloc 
area crosses a page? Will you use a bounce buffer? Where do you allocate 
the bounce buffer from? What if you run out of bounce buffers?

Mikulkas

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-20 12:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, bhutchings, David Miller, Vlastimil Babka
In-Reply-To: <20180419162250.00bf82e2c40b4960a7e23cdc@linux-foundation.org>



On Thu, 19 Apr 2018, Andrew Morton wrote:

> On Thu, 19 Apr 2018 17:19:20 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > > > In order to detect these bugs reliably I submit this patch that changes
> > > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > > > 
> > > > ...
> > > >
> > > > --- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
> > > > +++ linux-2.6/mm/util.c	2018-04-18 16:00:43.000000000 +0200
> > > > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > > >   */
> > > >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > > >  {
> > > > +#ifndef CONFIG_DEBUG_VM
> > > >  	gfp_t kmalloc_flags = flags;
> > > >  	void *ret;
> > > >  
> > > > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > >  	 */
> > > >  	if (ret || size <= PAGE_SIZE)
> > > >  		return ret;
> > > > +#endif
> > > >  
> > > >  	return __vmalloc_node_flags_caller(size, node, flags,
> > > >  			__builtin_return_address(0));
> > > 
> > > Well, it doesn't have to be done at compile-time, does it?  We could
> > > add a knob (in debugfs, presumably) which enables this at runtime. 
> > > That's far more user-friendly.
> > 
> > But who will turn it on in debugfs?
> 
> But who will turn it on in Kconfig?  Just a handful of developers.  We

So, it won't receive much testing.

I've never played with those debugfs files (because I didn't need it), and 
most users also won't play with it. Having a debugfs option is like having 
no option at all.

> could add SONFIG_DEBUG_SG to the list in
> Documentation/process/submit-checklist.rst, but nobody reads that.
> 
> Also, a whole bunch of defconfigs set CONFIG_DEBUG_SG=y and some
> googling indicates that they aren't the only ones...
> 
> > It should be default for debugging 
> > kernels, so that users using them would report the error.
> 
> Well.  This isn't the first time we've wanted to enable expensive (or
> noisy) debugging things in debug kernels, by any means.
> 
> So how could we define a debug kernel in which it's OK to enable such
> things?

Debug kernel is what distributions distribute as debug kernel - i.e. RHEL 
or Fedora have debugging kernels. So it needs to be bound to an option 
that is turned on in these kernels - so that any user who boots the 
debugging kernel triggers the bug.

> - Could be "it's an -rc kernel".  But then we'd be enabling a bunch of
>   untested code when Linus cuts a release.
> 
> - Could be "it's an -rc kernel with SUBLEVEL <= 5".  But then we risk
>   unexpected things happening when Linux cuts -rc6, which still isn't
>   good.
> 
> - How about "it's an -rc kernel with odd-numbered SUBLEVEL and
>   SUBLEVEL <= 5".  That way everybody who runs -rc1, -rc3 and -rc5 will
>   have kvmalloc debugging enabled.  That's potentially nasty because
>   vmalloc is much slower than kmalloc.  But kvmalloc() is only used for
>   large and probably infrequent allocations, so it's probably OK.
> 
> I wonder how we get at SUBLEVEL from within .c.  

Don't bind it to rc level, bind it to some debugging configuration option.

Mikulas

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Matthew Wilcox @ 2018-04-20 11:47 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, bhutchings, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804191207380.31175@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, Apr 19, 2018 at 12:12:38PM -0400, Mikulas Patocka wrote:
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.

Maybe it's time to have the SG code handle vmalloced pages?  This is
becoming more and more common with vmapped stacks (and some of our
workarounds are hideous -- allocate 4 bytes with kmalloc because we can't
DMA onto the stack any more?).  We already have a few places which do
handle sgs of vmalloced addresses, such as the nx crypto driver:

        if (is_vmalloc_addr(start_addr))
                sg_addr = page_to_phys(vmalloc_to_page(start_addr))
                          + offset_in_page(sg_addr);
        else
                sg_addr = __pa(sg_addr);

and videobuf:

                pg = vmalloc_to_page(virt);
                if (NULL == pg)
                        goto err;
                BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
                sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);

Yes, there's the potential that we have to produce two SG entries for a
virtually contiguous region if it crosses a page boundary, and our APIs
aren't set up right to make it happen.  But this is something we should
consider fixing ... otherwise we'll end up with dozens of driver hacks.
The videobuf implementation was already copy-and-pasted into the saa7146
driver, for example.

^ permalink raw reply

* Re: [PATCH] drm/virtio: fix vq wait_event condition
From: Dave Airlie @ 2018-04-20 10:12 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Dave Airlie, open list, stable, open list:VIRTIO GPU DRIVER,
	dri-devel, amagloire
In-Reply-To: <20180420072228.uorb3h7snx4io6xs@sirius.home.kraxel.org>


[-- Attachment #1.1: Type: text/plain, Size: 2201 bytes --]

Reviewed-by: Dave Airlie <airlied@redhat.com>

On Fri., 20 Apr. 2018, 17:23 Gerd Hoffmann, <kraxel@redhat.com> wrote:

> On Tue, Apr 03, 2018 at 11:59:04AM +0200, Gerd Hoffmann wrote:
> > Wait until we have enough space in the virt queue to actually queue up
> > our request.  Avoids the guest spinning in case we have a non-zero
> > amount of free entries but not enough for the request.
>
> Ping airlied, can you please either pick it up or review so I can commit
> myself?
>
> thanks,
>   Gerd
>
> > Cc: stable@vger.kernel.org
> > Reported-by: Alain Magloire <amagloire@blackberry.com>
> > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> > ---
> >  drivers/gpu/drm/virtio/virtgpu_vq.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c
> b/drivers/gpu/drm/virtio/virtgpu_vq.c
> > index 48e4f1df6e..020070d483 100644
> > --- a/drivers/gpu/drm/virtio/virtgpu_vq.c
> > +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
> > @@ -293,7 +293,7 @@ static int
> virtio_gpu_queue_ctrl_buffer_locked(struct virtio_gpu_device *vgdev,
> >       ret = virtqueue_add_sgs(vq, sgs, outcnt, incnt, vbuf, GFP_ATOMIC);
> >       if (ret == -ENOSPC) {
> >               spin_unlock(&vgdev->ctrlq.qlock);
> > -             wait_event(vgdev->ctrlq.ack_queue, vq->num_free);
> > +             wait_event(vgdev->ctrlq.ack_queue, vq->num_free >= outcnt
> + incnt);
> >               spin_lock(&vgdev->ctrlq.qlock);
> >               goto retry;
> >       } else {
> > @@ -368,7 +368,7 @@ static int virtio_gpu_queue_cursor(struct
> virtio_gpu_device *vgdev,
> >       ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
> >       if (ret == -ENOSPC) {
> >               spin_unlock(&vgdev->cursorq.qlock);
> > -             wait_event(vgdev->cursorq.ack_queue, vq->num_free);
> > +             wait_event(vgdev->cursorq.ack_queue, vq->num_free >=
> outcnt);
> >               spin_lock(&vgdev->cursorq.qlock);
> >               goto retry;
> >       } else {
> > --
> > 2.9.3
> >
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

[-- Attachment #1.2: Type: text/html, Size: 3378 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v2 4/4] qxl: drop dummy functions
From: Daniel Vetter @ 2018-04-20  8:14 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: David Airlie, David Airlie, open list, dri-devel,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-5-kraxel@redhat.com>

On Fri, Apr 20, 2018 at 09:19:04AM +0200, Gerd Hoffmann wrote:
> These days drm core checks function pointers everywhere before calling
> them.  So we can drop a bunch of dummy functions now.
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>  drivers/gpu/drm/qxl/qxl_display.c | 50 ---------------------------------------
>  1 file changed, 50 deletions(-)
> 
> diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
> index 06ee00b486..af6e52af5f 100644
> --- a/drivers/gpu/drm/qxl/qxl_display.c
> +++ b/drivers/gpu/drm/qxl/qxl_display.c
> @@ -456,13 +456,6 @@ qxl_framebuffer_init(struct drm_device *dev,
>  	return 0;
>  }
>  
> -static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
> -				  const struct drm_display_mode *mode,
> -				  struct drm_display_mode *adjusted_mode)
> -{
> -	return true;
> -}
> -
>  static void qxl_crtc_atomic_enable(struct drm_crtc *crtc,
>  				   struct drm_crtc_state *old_state)
>  {
> @@ -476,7 +469,6 @@ static void qxl_crtc_atomic_disable(struct drm_crtc *crtc,
>  }
>  
>  static const struct drm_crtc_helper_funcs qxl_crtc_helper_funcs = {
> -	.mode_fixup = qxl_crtc_mode_fixup,
>  	.atomic_flush = qxl_crtc_atomic_flush,
>  	.atomic_enable = qxl_crtc_atomic_enable,
>  	.atomic_disable = qxl_crtc_atomic_disable,
> @@ -620,12 +612,6 @@ static void qxl_primary_atomic_disable(struct drm_plane *plane,
>  	}
>  }
>  
> -static int qxl_plane_atomic_check(struct drm_plane *plane,
> -				  struct drm_plane_state *state)
> -{
> -	return 0;
> -}
> -
>  static void qxl_cursor_atomic_update(struct drm_plane *plane,
>  				     struct drm_plane_state *old_state)
>  {
> @@ -831,7 +817,6 @@ static const uint32_t qxl_cursor_plane_formats[] = {
>  };
>  
>  static const struct drm_plane_helper_funcs qxl_cursor_helper_funcs = {
> -	.atomic_check = qxl_plane_atomic_check,
>  	.atomic_update = qxl_cursor_atomic_update,
>  	.atomic_disable = qxl_cursor_atomic_disable,
>  	.prepare_fb = qxl_plane_prepare_fb,
> @@ -956,28 +941,6 @@ static int qdev_crtc_init(struct drm_device *dev, int crtc_id)
>  	return r;
>  }
>  
> -static void qxl_enc_dpms(struct drm_encoder *encoder, int mode)
> -{
> -	DRM_DEBUG("\n");
> -}
> -
> -static void qxl_enc_prepare(struct drm_encoder *encoder)
> -{
> -	DRM_DEBUG("\n");
> -}
> -
> -static void qxl_enc_commit(struct drm_encoder *encoder)
> -{
> -	DRM_DEBUG("\n");
> -}
> -
> -static void qxl_enc_mode_set(struct drm_encoder *encoder,
> -				struct drm_display_mode *mode,
> -				struct drm_display_mode *adjusted_mode)
> -{
> -	DRM_DEBUG("\n");
> -}
> -
>  static int qxl_conn_get_modes(struct drm_connector *connector)
>  {
>  	unsigned pwidth = 1024;
> @@ -1023,10 +986,6 @@ static struct drm_encoder *qxl_best_encoder(struct drm_connector *connector)
>  
>  
>  static const struct drm_encoder_helper_funcs qxl_enc_helper_funcs = {

Hm, I thought even the vtable itself is optional? We do have tons of if
(!funcs) checks all over the place at least.


> -	.dpms = qxl_enc_dpms,
> -	.prepare = qxl_enc_prepare,
> -	.mode_set = qxl_enc_mode_set,
> -	.commit = qxl_enc_commit,
>  };
>  
>  static const struct drm_connector_helper_funcs qxl_connector_helper_funcs = {
> @@ -1059,14 +1018,6 @@ static enum drm_connector_status qxl_conn_detect(
>  			 : connector_status_disconnected;
>  }
>  
> -static int qxl_conn_set_property(struct drm_connector *connector,
> -				   struct drm_property *property,
> -				   uint64_t value)
> -{
> -	DRM_DEBUG("\n");
> -	return 0;
> -}
> -
>  static void qxl_conn_destroy(struct drm_connector *connector)
>  {
>  	struct qxl_output *qxl_output =
> @@ -1081,7 +1032,6 @@ static const struct drm_connector_funcs qxl_connector_funcs = {
>  	.dpms = drm_helper_connector_dpms,
>  	.detect = qxl_conn_detect,
>  	.fill_modes = drm_helper_probe_single_connector_modes,
> -	.set_property = qxl_conn_set_property,

This is a legacy callback anyway. I kinda wonder whether we should have a
bunch of WARN_ON checks for this, so that atomic drivers don't have these
hanging around anymore.

Anyway, patch looks pretty.

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>  	.destroy = qxl_conn_destroy,
>  	.reset = drm_atomic_helper_connector_reset,
>  	.atomic_duplicate_state = drm_atomic_helper_connector_duplicate_state,
> -- 
> 2.9.3
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply

* Re: [PATCH 1/2] qxl: fix qxl_release_{map,unmap}
From: Daniel Vetter @ 2018-04-20  7:48 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: David Airlie, Dave Airlie, open list, dri-devel,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180418054257.15388-2-kraxel@redhat.com>

On Wed, Apr 18, 2018 at 07:42:56AM +0200, Gerd Hoffmann wrote:
> s/PAGE_SIZE/PAGE_MASK/
> 
> Luckily release_offset is never larger than PAGE_SIZE, so the bug has no
> bad side effects and managed to stay unnoticed for years that way ...
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

Sweeet. Since the buggy code uses the same expression for page frame and
offset I don't think there's a security bug. You might still want to cc:
stable (since without you defacto can't ever use this feature).

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/qxl/qxl_ioctl.c   | 4 ++--
>  drivers/gpu/drm/qxl/qxl_release.c | 6 +++---
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/qxl/qxl_ioctl.c b/drivers/gpu/drm/qxl/qxl_ioctl.c
> index e238a1a2ec..6cc9f3367f 100644
> --- a/drivers/gpu/drm/qxl/qxl_ioctl.c
> +++ b/drivers/gpu/drm/qxl/qxl_ioctl.c
> @@ -182,9 +182,9 @@ static int qxl_process_single_command(struct qxl_device *qdev,
>  		goto out_free_reloc;
>  
>  	/* TODO copy slow path code from i915 */
> -	fb_cmd = qxl_bo_kmap_atomic_page(qdev, cmd_bo, (release->release_offset & PAGE_SIZE));
> +	fb_cmd = qxl_bo_kmap_atomic_page(qdev, cmd_bo, (release->release_offset & PAGE_MASK));
>  	unwritten = __copy_from_user_inatomic_nocache
> -		(fb_cmd + sizeof(union qxl_release_info) + (release->release_offset & ~PAGE_SIZE),
> +		(fb_cmd + sizeof(union qxl_release_info) + (release->release_offset & ~PAGE_MASK),
>  		 u64_to_user_ptr(cmd->command), cmd->command_size);
>  
>  	{
> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
> index 5d84a66fed..a0b4244d28 100644
> --- a/drivers/gpu/drm/qxl/qxl_release.c
> +++ b/drivers/gpu/drm/qxl/qxl_release.c
> @@ -411,10 +411,10 @@ union qxl_release_info *qxl_release_map(struct qxl_device *qdev,
>  	struct qxl_bo_list *entry = list_first_entry(&release->bos, struct qxl_bo_list, tv.head);
>  	struct qxl_bo *bo = to_qxl_bo(entry->tv.bo);
>  
> -	ptr = qxl_bo_kmap_atomic_page(qdev, bo, release->release_offset & PAGE_SIZE);
> +	ptr = qxl_bo_kmap_atomic_page(qdev, bo, release->release_offset & PAGE_MASK);
>  	if (!ptr)
>  		return NULL;
> -	info = ptr + (release->release_offset & ~PAGE_SIZE);
> +	info = ptr + (release->release_offset & ~PAGE_MASK);
>  	return info;
>  }
>  
> @@ -426,7 +426,7 @@ void qxl_release_unmap(struct qxl_device *qdev,
>  	struct qxl_bo *bo = to_qxl_bo(entry->tv.bo);
>  	void *ptr;
>  
> -	ptr = ((void *)info) - (release->release_offset & ~PAGE_SIZE);
> +	ptr = ((void *)info) - (release->release_offset & ~PAGE_MASK);
>  	qxl_bo_kunmap_atomic_page(qdev, bo, ptr);
>  }
>  
> -- 
> 2.9.3
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply

* Re: [PATCH] drm/virtio: fix vq wait_event condition
From: Gerd Hoffmann @ 2018-04-20  7:22 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, open list, amagloire, stable,
	open list:VIRTIO GPU DRIVER
In-Reply-To: <20180403095904.11152-1-kraxel@redhat.com>

On Tue, Apr 03, 2018 at 11:59:04AM +0200, Gerd Hoffmann wrote:
> Wait until we have enough space in the virt queue to actually queue up
> our request.  Avoids the guest spinning in case we have a non-zero
> amount of free entries but not enough for the request.

Ping airlied, can you please either pick it up or review so I can commit
myself?

thanks,
  Gerd

> Cc: stable@vger.kernel.org
> Reported-by: Alain Magloire <amagloire@blackberry.com>
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>  drivers/gpu/drm/virtio/virtgpu_vq.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c b/drivers/gpu/drm/virtio/virtgpu_vq.c
> index 48e4f1df6e..020070d483 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_vq.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
> @@ -293,7 +293,7 @@ static int virtio_gpu_queue_ctrl_buffer_locked(struct virtio_gpu_device *vgdev,
>  	ret = virtqueue_add_sgs(vq, sgs, outcnt, incnt, vbuf, GFP_ATOMIC);
>  	if (ret == -ENOSPC) {
>  		spin_unlock(&vgdev->ctrlq.qlock);
> -		wait_event(vgdev->ctrlq.ack_queue, vq->num_free);
> +		wait_event(vgdev->ctrlq.ack_queue, vq->num_free >= outcnt + incnt);
>  		spin_lock(&vgdev->ctrlq.qlock);
>  		goto retry;
>  	} else {
> @@ -368,7 +368,7 @@ static int virtio_gpu_queue_cursor(struct virtio_gpu_device *vgdev,
>  	ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
>  	if (ret == -ENOSPC) {
>  		spin_unlock(&vgdev->cursorq.qlock);
> -		wait_event(vgdev->cursorq.ack_queue, vq->num_free);
> +		wait_event(vgdev->cursorq.ack_queue, vq->num_free >= outcnt);
>  		spin_lock(&vgdev->cursorq.qlock);
>  		goto retry;
>  	} else {
> -- 
> 2.9.3
> 

^ permalink raw reply

* [PATCH v2 4/4] qxl: drop dummy functions
From: Gerd Hoffmann @ 2018-04-20  7:19 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, David Airlie, open list,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-1-kraxel@redhat.com>

These days drm core checks function pointers everywhere before calling
them.  So we can drop a bunch of dummy functions now.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/qxl/qxl_display.c | 50 ---------------------------------------
 1 file changed, 50 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
index 06ee00b486..af6e52af5f 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -456,13 +456,6 @@ qxl_framebuffer_init(struct drm_device *dev,
 	return 0;
 }
 
-static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
-				  const struct drm_display_mode *mode,
-				  struct drm_display_mode *adjusted_mode)
-{
-	return true;
-}
-
 static void qxl_crtc_atomic_enable(struct drm_crtc *crtc,
 				   struct drm_crtc_state *old_state)
 {
@@ -476,7 +469,6 @@ static void qxl_crtc_atomic_disable(struct drm_crtc *crtc,
 }
 
 static const struct drm_crtc_helper_funcs qxl_crtc_helper_funcs = {
-	.mode_fixup = qxl_crtc_mode_fixup,
 	.atomic_flush = qxl_crtc_atomic_flush,
 	.atomic_enable = qxl_crtc_atomic_enable,
 	.atomic_disable = qxl_crtc_atomic_disable,
@@ -620,12 +612,6 @@ static void qxl_primary_atomic_disable(struct drm_plane *plane,
 	}
 }
 
-static int qxl_plane_atomic_check(struct drm_plane *plane,
-				  struct drm_plane_state *state)
-{
-	return 0;
-}
-
 static void qxl_cursor_atomic_update(struct drm_plane *plane,
 				     struct drm_plane_state *old_state)
 {
@@ -831,7 +817,6 @@ static const uint32_t qxl_cursor_plane_formats[] = {
 };
 
 static const struct drm_plane_helper_funcs qxl_cursor_helper_funcs = {
-	.atomic_check = qxl_plane_atomic_check,
 	.atomic_update = qxl_cursor_atomic_update,
 	.atomic_disable = qxl_cursor_atomic_disable,
 	.prepare_fb = qxl_plane_prepare_fb,
@@ -956,28 +941,6 @@ static int qdev_crtc_init(struct drm_device *dev, int crtc_id)
 	return r;
 }
 
-static void qxl_enc_dpms(struct drm_encoder *encoder, int mode)
-{
-	DRM_DEBUG("\n");
-}
-
-static void qxl_enc_prepare(struct drm_encoder *encoder)
-{
-	DRM_DEBUG("\n");
-}
-
-static void qxl_enc_commit(struct drm_encoder *encoder)
-{
-	DRM_DEBUG("\n");
-}
-
-static void qxl_enc_mode_set(struct drm_encoder *encoder,
-				struct drm_display_mode *mode,
-				struct drm_display_mode *adjusted_mode)
-{
-	DRM_DEBUG("\n");
-}
-
 static int qxl_conn_get_modes(struct drm_connector *connector)
 {
 	unsigned pwidth = 1024;
@@ -1023,10 +986,6 @@ static struct drm_encoder *qxl_best_encoder(struct drm_connector *connector)
 
 
 static const struct drm_encoder_helper_funcs qxl_enc_helper_funcs = {
-	.dpms = qxl_enc_dpms,
-	.prepare = qxl_enc_prepare,
-	.mode_set = qxl_enc_mode_set,
-	.commit = qxl_enc_commit,
 };
 
 static const struct drm_connector_helper_funcs qxl_connector_helper_funcs = {
@@ -1059,14 +1018,6 @@ static enum drm_connector_status qxl_conn_detect(
 			 : connector_status_disconnected;
 }
 
-static int qxl_conn_set_property(struct drm_connector *connector,
-				   struct drm_property *property,
-				   uint64_t value)
-{
-	DRM_DEBUG("\n");
-	return 0;
-}
-
 static void qxl_conn_destroy(struct drm_connector *connector)
 {
 	struct qxl_output *qxl_output =
@@ -1081,7 +1032,6 @@ static const struct drm_connector_funcs qxl_connector_funcs = {
 	.dpms = drm_helper_connector_dpms,
 	.detect = qxl_conn_detect,
 	.fill_modes = drm_helper_probe_single_connector_modes,
-	.set_property = qxl_conn_set_property,
 	.destroy = qxl_conn_destroy,
 	.reset = drm_atomic_helper_connector_reset,
 	.atomic_duplicate_state = drm_atomic_helper_connector_duplicate_state,
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 3/4] qxl: hook monitors_config updates into crtc, not encoder.
From: Gerd Hoffmann @ 2018-04-20  7:19 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, David Airlie, open list,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-1-kraxel@redhat.com>

The encoder callbacks are only called in case the video mode changes.
So any layout changes without mode changes will go unnoticed.

Add qxl_crtc_update_monitors_config(), based on the old
qxl_write_monitors_config_for_encoder() function.  Hook it into the
enable, disable and flush atomic crtc callbacks.  Remove monitors_config
updates from all other places.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1544322
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/qxl/qxl_cmd.c     |   2 +
 drivers/gpu/drm/qxl/qxl_display.c | 156 ++++++++++++++++----------------------
 2 files changed, 66 insertions(+), 92 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_cmd.c b/drivers/gpu/drm/qxl/qxl_cmd.c
index 850f8d7d37..95db20f214 100644
--- a/drivers/gpu/drm/qxl/qxl_cmd.c
+++ b/drivers/gpu/drm/qxl/qxl_cmd.c
@@ -371,6 +371,7 @@ void qxl_io_flush_surfaces(struct qxl_device *qdev)
 void qxl_io_destroy_primary(struct qxl_device *qdev)
 {
 	wait_for_io_cmd(qdev, 0, QXL_IO_DESTROY_PRIMARY_ASYNC);
+	qdev->primary_created = false;
 }
 
 void qxl_io_create_primary(struct qxl_device *qdev,
@@ -396,6 +397,7 @@ void qxl_io_create_primary(struct qxl_device *qdev,
 	create->type = QXL_SURF_TYPE_PRIMARY;
 
 	wait_for_io_cmd(qdev, 0, QXL_IO_CREATE_PRIMARY_ASYNC);
+	qdev->primary_created = true;
 }
 
 void qxl_io_memslot_add(struct qxl_device *qdev, uint8_t id)
diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
index 68db4d1965..06ee00b486 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -281,6 +281,66 @@ static void qxl_send_monitors_config(struct qxl_device *qdev)
 	qxl_io_monitors_config(qdev);
 }
 
+static void qxl_crtc_update_monitors_config(struct drm_crtc *crtc,
+					    const char *reason)
+{
+	struct drm_device *dev = crtc->dev;
+	struct qxl_device *qdev = dev->dev_private;
+	struct qxl_crtc *qcrtc = to_qxl_crtc(crtc);
+	struct qxl_head head;
+	int oldcount, i = qcrtc->index;
+
+	if (!qdev->primary_created) {
+		DRM_DEBUG_KMS("no primary surface, skip (%s)\n", reason);
+		return;
+	}
+
+	if (!qdev->monitors_config ||
+	    qdev->monitors_config->max_allowed <= i)
+		return;
+
+	head.id = i;
+	head.flags = 0;
+	oldcount = qdev->monitors_config->count;
+	if (crtc->state->active) {
+		struct drm_display_mode *mode = &crtc->mode;
+		head.width = mode->hdisplay;
+		head.height = mode->vdisplay;
+		head.x = crtc->x;
+		head.y = crtc->y;
+		if (qdev->monitors_config->count < i + 1)
+			qdev->monitors_config->count = i + 1;
+	} else if (i > 0) {
+		head.width = 0;
+		head.height = 0;
+		head.x = 0;
+		head.y = 0;
+		if (qdev->monitors_config->count == i + 1)
+			qdev->monitors_config->count = i;
+	} else {
+		DRM_DEBUG_KMS("inactive head 0, skip (%s)\n", reason);
+		return;
+	}
+
+	if (head.width  == qdev->monitors_config->heads[i].width  &&
+	    head.height == qdev->monitors_config->heads[i].height &&
+	    head.x      == qdev->monitors_config->heads[i].x      &&
+	    head.y      == qdev->monitors_config->heads[i].y      &&
+	    oldcount    == qdev->monitors_config->count)
+		return;
+
+	DRM_DEBUG_KMS("head %d, %dx%d, at +%d+%d, %s (%s)\n",
+		      i, head.width, head.height, head.x, head.y,
+		      crtc->state->active ? "on" : "off", reason);
+	if (oldcount != qdev->monitors_config->count)
+		DRM_DEBUG_KMS("active heads %d -> %d (%d total)\n",
+			      oldcount, qdev->monitors_config->count,
+			      qdev->monitors_config->max_allowed);
+
+	qdev->monitors_config->heads[i] = head;
+	qxl_send_monitors_config(qdev);
+}
+
 static void qxl_crtc_atomic_flush(struct drm_crtc *crtc,
 				  struct drm_crtc_state *old_crtc_state)
 {
@@ -296,6 +356,8 @@ static void qxl_crtc_atomic_flush(struct drm_crtc *crtc,
 		drm_crtc_send_vblank_event(crtc, event);
 		spin_unlock_irqrestore(&dev->event_lock, flags);
 	}
+
+	qxl_crtc_update_monitors_config(crtc, "flush");
 }
 
 static void qxl_crtc_destroy(struct drm_crtc *crtc)
@@ -401,55 +463,20 @@ static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
 	return true;
 }
 
-static void qxl_monitors_config_set(struct qxl_device *qdev,
-				    int index,
-				    unsigned x, unsigned y,
-				    unsigned width, unsigned height,
-				    unsigned surf_id)
-{
-	DRM_DEBUG_KMS("%d:%dx%d+%d+%d\n", index, width, height, x, y);
-	qdev->monitors_config->heads[index].x = x;
-	qdev->monitors_config->heads[index].y = y;
-	qdev->monitors_config->heads[index].width = width;
-	qdev->monitors_config->heads[index].height = height;
-	qdev->monitors_config->heads[index].surface_id = surf_id;
-
-}
-
-static void qxl_mode_set_nofb(struct drm_crtc *crtc)
-{
-	struct qxl_device *qdev = crtc->dev->dev_private;
-	struct qxl_crtc *qcrtc = to_qxl_crtc(crtc);
-	struct drm_display_mode *mode = &crtc->mode;
-
-	DRM_DEBUG("Mode set (%d,%d)\n",
-		  mode->hdisplay, mode->vdisplay);
-
-	qxl_monitors_config_set(qdev, qcrtc->index, 0, 0,
-				mode->hdisplay,	mode->vdisplay, 0);
-
-}
-
 static void qxl_crtc_atomic_enable(struct drm_crtc *crtc,
 				   struct drm_crtc_state *old_state)
 {
-	DRM_DEBUG("\n");
+	qxl_crtc_update_monitors_config(crtc, "enable");
 }
 
 static void qxl_crtc_atomic_disable(struct drm_crtc *crtc,
 				    struct drm_crtc_state *old_state)
 {
-	struct qxl_crtc *qcrtc = to_qxl_crtc(crtc);
-	struct qxl_device *qdev = crtc->dev->dev_private;
-
-	qxl_monitors_config_set(qdev, qcrtc->index, 0, 0, 0, 0, 0);
-
-	qxl_send_monitors_config(qdev);
+	qxl_crtc_update_monitors_config(crtc, "disable");
 }
 
 static const struct drm_crtc_helper_funcs qxl_crtc_helper_funcs = {
 	.mode_fixup = qxl_crtc_mode_fixup,
-	.mode_set_nofb = qxl_mode_set_nofb,
 	.atomic_flush = qxl_crtc_atomic_flush,
 	.atomic_enable = qxl_crtc_atomic_enable,
 	.atomic_disable = qxl_crtc_atomic_disable,
@@ -939,61 +966,8 @@ static void qxl_enc_prepare(struct drm_encoder *encoder)
 	DRM_DEBUG("\n");
 }
 
-static void qxl_write_monitors_config_for_encoder(struct qxl_device *qdev,
-		struct drm_encoder *encoder)
-{
-	int i;
-	struct qxl_output *output = drm_encoder_to_qxl_output(encoder);
-	struct qxl_head *head;
-	struct drm_display_mode *mode;
-
-	BUG_ON(!encoder);
-	/* TODO: ugly, do better */
-	i = output->index;
-	if (!qdev->monitors_config ||
-	    qdev->monitors_config->max_allowed <= i) {
-		DRM_ERROR(
-		"head number too large or missing monitors config: %p, %d",
-		qdev->monitors_config,
-		qdev->monitors_config ?
-			qdev->monitors_config->max_allowed : -1);
-		return;
-	}
-	if (!encoder->crtc) {
-		DRM_ERROR("missing crtc on encoder %p\n", encoder);
-		return;
-	}
-	if (i != 0)
-		DRM_DEBUG("missing for multiple monitors: no head holes\n");
-	head = &qdev->monitors_config->heads[i];
-	head->id = i;
-	if (encoder->crtc->enabled) {
-		mode = &encoder->crtc->mode;
-		head->width = mode->hdisplay;
-		head->height = mode->vdisplay;
-		head->x = encoder->crtc->x;
-		head->y = encoder->crtc->y;
-		if (qdev->monitors_config->count < i + 1)
-			qdev->monitors_config->count = i + 1;
-	} else {
-		head->width = 0;
-		head->height = 0;
-		head->x = 0;
-		head->y = 0;
-	}
-	DRM_DEBUG_KMS("setting head %d to +%d+%d %dx%d out of %d\n",
-		      i, head->x, head->y, head->width, head->height, qdev->monitors_config->count);
-	head->flags = 0;
-	/* TODO - somewhere else to call this for multiple monitors
-	 * (config_commit?) */
-	qxl_send_monitors_config(qdev);
-}
-
 static void qxl_enc_commit(struct drm_encoder *encoder)
 {
-	struct qxl_device *qdev = encoder->dev->dev_private;
-
-	qxl_write_monitors_config_for_encoder(qdev, encoder);
 	DRM_DEBUG("\n");
 }
 
@@ -1080,8 +1054,6 @@ static enum drm_connector_status qxl_conn_detect(
 		     qxl_head_enabled(&qdev->client_monitors_config->heads[output->index]);
 
 	DRM_DEBUG("#%d connected: %d\n", output->index, connected);
-	if (!connected)
-		qxl_monitors_config_set(qdev, output->index, 0, 0, 0, 0, 0);
 
 	return connected ? connector_status_connected
 			 : connector_status_disconnected;
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 2/4] qxl: move qxl_send_monitors_config()
From: Gerd Hoffmann @ 2018-04-20  7:19 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, David Airlie, open list,
	open list:DRM DRIVER FOR QXL VIRTUAL GPU
In-Reply-To: <20180420071904.24276-1-kraxel@redhat.com>

Needed to avoid a forward declaration in a followup patch.
Pure code move, no functional change.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/qxl/qxl_display.c | 47 +++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c
index dea757ef62..68db4d1965 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -258,6 +258,29 @@ static int qxl_add_common_modes(struct drm_connector *connector,
 	return i - 1;
 }
 
+static void qxl_send_monitors_config(struct qxl_device *qdev)
+{
+	int i;
+
+	BUG_ON(!qdev->ram_header->monitors_config);
+
+	if (qdev->monitors_config->count == 0)
+		return;
+
+	for (i = 0 ; i < qdev->monitors_config->count ; ++i) {
+		struct qxl_head *head = &qdev->monitors_config->heads[i];
+
+		if (head->y > 8192 || head->x > 8192 ||
+		    head->width > 8192 || head->height > 8192) {
+			DRM_ERROR("head %d wrong: %dx%d+%d+%d\n",
+				  i, head->width, head->height,
+				  head->x, head->y);
+			return;
+		}
+	}
+	qxl_io_monitors_config(qdev);
+}
+
 static void qxl_crtc_atomic_flush(struct drm_crtc *crtc,
 				  struct drm_crtc_state *old_crtc_state)
 {
@@ -378,30 +401,6 @@ static bool qxl_crtc_mode_fixup(struct drm_crtc *crtc,
 	return true;
 }
 
-static void
-qxl_send_monitors_config(struct qxl_device *qdev)
-{
-	int i;
-
-	BUG_ON(!qdev->ram_header->monitors_config);
-
-	if (qdev->monitors_config->count == 0)
-		return;
-
-	for (i = 0 ; i < qdev->monitors_config->count ; ++i) {
-		struct qxl_head *head = &qdev->monitors_config->heads[i];
-
-		if (head->y > 8192 || head->x > 8192 ||
-		    head->width > 8192 || head->height > 8192) {
-			DRM_ERROR("head %d wrong: %dx%d+%d+%d\n",
-				  i, head->width, head->height,
-				  head->x, head->y);
-			return;
-		}
-	}
-	qxl_io_monitors_config(qdev);
-}
-
 static void qxl_monitors_config_set(struct qxl_device *qdev,
 				    int index,
 				    unsigned x, unsigned y,
-- 
2.9.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox