Linux virtualization list
 help / color / mirror / Atom feed
* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
       [not found] ` <1523386790-12396-5-git-send-email-sridhar.samudrala@intel.com>
@ 2018-04-10 21:26   ` Stephen Hemminger
       [not found]   ` <20180410142608.50f15b45@xeon-e3>
  1 sibling, 0 replies; 47+ messages in thread
From: Stephen Hemminger @ 2018-04-10 21:26 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, davem

On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:

> Use the registration/notification framework supported by the generic
> bypass infrastructure.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.

Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.

Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
       [not found]   ` <20180410142608.50f15b45@xeon-e3>
@ 2018-04-10 22:56     ` Samudrala, Sridhar
  2018-04-10 23:28     ` Michael S. Tsirkin
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Samudrala, Sridhar @ 2018-04-10 22:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, davem

On 4/10/2018 2:26 PM, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
> Thanks for doing this.  Your current version has couple show stopper
> issues.
>
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

OK. I guess you are referring to the dev_set_mtu() and dev_open() calls that are
made in bypass_slave_register() and you want to defer them to be done after
a delay.  I could avoid these calls in case of netvsc based on bypass_ops.


>
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.

netvsc should not be using bypass_select_queue() as  that ndo op gets used
only with 3-netdev model.
Anyway, will look into updating bypass_select_queue() based on your fix.

>
> Lastly, more indirection is bad in current climate.
>
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

Not sure we can avoid this indirection if we want to commonize the code,  but use
different models for virtio-net and netvsc.

On the other hand, these patches avoid calls to get_netvsc_bymac() and
get_netvsc_by_ref() that go through all the devices for all the netdev events.
netvsc lookups should be much faster.

Thanks
Sridhar

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
       [not found]   ` <20180410142608.50f15b45@xeon-e3>
  2018-04-10 22:56     ` Samudrala, Sridhar
@ 2018-04-10 23:28     ` Michael S. Tsirkin
       [not found]     ` <20180411022807-mutt-send-email-mst@kernel.org>
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2018-04-10 23:28 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, davem

On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

Interesting. Does this mean udev must act within a specific time window
then?

> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.
> 
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
       [not found]     ` <20180411022807-mutt-send-email-mst@kernel.org>
@ 2018-04-10 23:44       ` Siwei Liu
       [not found]       ` <CADGSJ22rVsC0TDTd6OKVnwbx0ExoQ8xWXBMumKB-OFH4sX=yaQ@mail.gmail.com>
  2018-04-11  7:50       ` Jiri Pirko
  2 siblings, 0 replies; 47+ messages in thread
From: Siwei Liu @ 2018-04-10 23:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Netdev, David Miller

On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>>
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
> Interesting. Does this mean udev must act within a specific time window
> then?

Sighs, lots of hacks. Why propgating this from driver to a common
module. We really need a clean solution.

-Siwei


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
       [not found]       ` <CADGSJ22rVsC0TDTd6OKVnwbx0ExoQ8xWXBMumKB-OFH4sX=yaQ@mail.gmail.com>
@ 2018-04-10 23:59         ` Stephen Hemminger
  0 siblings, 0 replies; 47+ messages in thread
From: Stephen Hemminger @ 2018-04-10 23:59 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Netdev,
	David Miller

On Tue, 10 Apr 2018 16:44:47 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:  
> >> On Tue, 10 Apr 2018 11:59:50 -0700
> >> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >>  
> >> > Use the registration/notification framework supported by the generic
> >> > bypass infrastructure.
> >> >
> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > ---  
> >>
> >> Thanks for doing this.  Your current version has couple show stopper
> >> issues.
> >>
> >> First, the slave device is instantly taking over the slave.
> >> This doesn't allow udev/systemd to do its device rename of the slave
> >> device. Netvsc uses a delayed work to workaround this.  
> >
> > Interesting. Does this mean udev must act within a specific time window
> > then?  
> 
> Sighs, lots of hacks. Why propgating this from driver to a common
> module. We really need a clean solution.
> 

I had a patch to wait for udev to do the rename and go from there
but davem rejected it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
       [not found]   ` <20180410142608.50f15b45@xeon-e3>
                       ` (2 preceding siblings ...)
       [not found]     ` <20180411022807-mutt-send-email-mst@kernel.org>
@ 2018-04-11  1:21     ` Michael S. Tsirkin
  2018-04-11  7:53     ` Jiri Pirko
       [not found]     ` <20180411075334.GK2028@nanopsycho>
  5 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2018-04-11  1:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, davem

On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.
> 
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.

Well right now netvsc does an indirect call to the PT device,
does it not? If you really want max performance when PT
is in use you need to do the reverse and have PT forward to netvsc.

> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
       [not found]     ` <20180411022807-mutt-send-email-mst@kernel.org>
  2018-04-10 23:44       ` Siwei Liu
       [not found]       ` <CADGSJ22rVsC0TDTd6OKVnwbx0ExoQ8xWXBMumKB-OFH4sX=yaQ@mail.gmail.com>
@ 2018-04-11  7:50       ` Jiri Pirko
  2 siblings, 0 replies; 47+ messages in thread
From: Jiri Pirko @ 2018-04-11  7:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, davem

Wed, Apr 11, 2018 at 01:28:51AM CEST, mst@redhat.com wrote:
>On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>> 
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>> 
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>> 
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
>Interesting. Does this mean udev must act within a specific time window
>then?

Yeah. That is scarry. Also, wrong.


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>> 
>> Lastly, more indirection is bad in current climate.
>> 
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
       [not found]   ` <20180410142608.50f15b45@xeon-e3>
                       ` (3 preceding siblings ...)
  2018-04-11  1:21     ` Michael S. Tsirkin
@ 2018-04-11  7:53     ` Jiri Pirko
       [not found]     ` <20180411075334.GK2028@nanopsycho>
  5 siblings, 0 replies; 47+ messages in thread
From: Jiri Pirko @ 2018-04-11  7:53 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, davem

Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen@networkplumber.org wrote:
>On Tue, 10 Apr 2018 11:59:50 -0700
>Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>> 
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>
>Thanks for doing this.  Your current version has couple show stopper
>issues.
>
>First, the slave device is instantly taking over the slave.
>This doesn't allow udev/systemd to do its device rename of the slave
>device. Netvsc uses a delayed work to workaround this.

Wait. Why the fact a device is enslaved has to affect the udev in any
way? If it does, smells like a bug in udev.


>
>Secondly, the select queue needs to call queue selection in VF.
>The bonding/teaming logic doesn't work well for UDP flows.
>Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>fixed this performance problem.
>
>Lastly, more indirection is bad in current climate.
>
>I am not completely adverse to this but it needs to be fast, simple
>and completely transparent.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found] ` <1523386790-12396-3-git-send-email-sridhar.samudrala@intel.com>
@ 2018-04-11 15:51   ` Jiri Pirko
       [not found]   ` <20180411155127.GQ2028@nanopsycho>
  1 sibling, 0 replies; 47+ messages in thread
From: Jiri Pirko @ 2018-04-11 15:51 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem

Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. existing netvsc driver that uses 2 netdev model. In this model, no
>master netdev is created. The paravirtual driver registers each bypass
>instance along with a set of ops to manage the slave events.
>     bypass_master_register()
>     bypass_master_unregister()
>2. new virtio_net based solution that uses 3 netdev model. In this model,
>the bypass module provides interfaces to create/destroy additional master
>netdev and all the slave events are managed internally.
>      bypass_master_create()
>      bypass_master_destroy()
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> include/linux/netdevice.h |  14 +
> include/net/bypass.h      |  96 ++++++
> net/Kconfig               |  18 +
> net/core/Makefile         |   1 +
> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 973 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index cf44503ea81a..587293728f70 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> 	IFF_PHONY_HEADROOM		= 1<<24,
> 	IFF_MACSEC			= 1<<25,
> 	IFF_NO_RX_HANDLER		= 1<<26,
>+	IFF_BYPASS			= 1 << 27,
>+	IFF_BYPASS_SLAVE		= 1 << 28,

I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.


> };
> 
> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
> #define IFF_MACSEC			IFF_MACSEC
> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>+#define IFF_BYPASS			IFF_BYPASS
>+#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
> 
> /**
>  *	struct net_device - The DEVICE structure.
>@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
> }
> 
>+static inline bool netif_is_bypass_master(const struct net_device *dev)
>+{
>+	return dev->priv_flags & IFF_BYPASS;
>+}
>+
>+static inline bool netif_is_bypass_slave(const struct net_device *dev)
>+{
>+	return dev->priv_flags & IFF_BYPASS_SLAVE;
>+}
>+
> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> static inline void netif_keep_dst(struct net_device *dev)
> {
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index 000000000000..86b02cb894cf
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,96 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include <linux/netdevice.h>
>+
>+struct bypass_ops {
>+	int (*slave_pre_register)(struct net_device *slave_netdev,
>+				  struct net_device *bypass_netdev);
>+	int (*slave_join)(struct net_device *slave_netdev,
>+			  struct net_device *bypass_netdev);
>+	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>+				    struct net_device *bypass_netdev);
>+	int (*slave_release)(struct net_device *slave_netdev,
>+			     struct net_device *bypass_netdev);
>+	int (*slave_link_change)(struct net_device *slave_netdev,
>+				 struct net_device *bypass_netdev);
>+	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_master {
>+	struct list_head list;
>+	struct net_device __rcu *bypass_netdev;
>+	struct bypass_ops __rcu *ops;
>+};
>+
>+/* bypass state */
>+struct bypass_info {
>+	/* passthru netdev with same MAC */
>+	struct net_device __rcu *active_netdev;

You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.


>+
>+	/* virtio_net netdev */
>+	struct net_device __rcu *backup_netdev;
>+
>+	/* active netdev stats */
>+	struct rtnl_link_stats64 active_stats;
>+
>+	/* backup netdev stats */
>+	struct rtnl_link_stats64 backup_stats;
>+
>+	/* aggregated stats */
>+	struct rtnl_link_stats64 bypass_stats;
>+
>+	/* spinlock while updating stats */
>+	spinlock_t stats_lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master);
>+void bypass_master_destroy(struct bypass_master *bypass_master);
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct bypass_master **pbypass_master);
>+void bypass_master_unregister(struct bypass_master *bypass_master);
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev);
>+
>+#else
>+
>+static inline
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master);
>+{
>+	return 0;
>+}
>+
>+static inline
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct pbypass_master **pbypass_master);
>+{
>+	return 0;
>+}
>+
>+static inline
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+	return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
> 	  devlink is a loadable module and the driver using it is built-in.
> 
>+config NET_BYPASS
>+	tristate "Bypass interface"
>+	---help---
>+	  This provides a generic interface for paravirtual drivers to listen
>+	  for netdev register/unregister/link change events from pci ethernet
>+	  devices with the same MAC and takeover their datapath. This also
>+	  enables live migration of a VM with direct attached VF by failing
>+	  over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+	tristate
>+	default m if NET_BYPASS=m
>+	default y if NET_BYPASS=y || NET_BYPASS=n
>+	help
>+	  Drivers using the bypass infrastructure should have a dependency
>+	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>+	  bypass is a loadable module and the driver using it is built-in.
>+
> endif   # if NET
> 
> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>diff --git a/net/core/Makefile b/net/core/Makefile
>index 6dbbba8c57ae..a9727ed1c8fc 100644
>--- a/net/core/Makefile
>+++ b/net/core/Makefile
>@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
> obj-$(CONFIG_HWBM) += hwbm.o
> obj-$(CONFIG_NET_DEVLINK) += devlink.o
> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>+obj-$(CONFIG_NET_BYPASS) += bypass.o
>diff --git a/net/core/bypass.c b/net/core/bypass.c
>new file mode 100644
>index 000000000000..b5b9cb554c3f
>--- /dev/null
>+++ b/net/core/bypass.c
>@@ -0,0 +1,844 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+/* A common module to handle registrations and notifications for paravirtual
>+ * drivers to enable accelerated datapath and support VF live migration.
>+ *
>+ * The notifier and event handling code is based on netvsc driver.
>+ */
>+
>+#include <linux/netdevice.h>
>+#include <linux/etherdevice.h>
>+#include <linux/ethtool.h>
>+#include <linux/module.h>
>+#include <linux/slab.h>
>+#include <linux/netdevice.h>
>+#include <linux/netpoll.h>
>+#include <linux/rtnetlink.h>
>+#include <linux/if_vlan.h>
>+#include <linux/pci.h>
>+#include <net/sch_generic.h>
>+#include <uapi/linux/if_arp.h>
>+#include <net/bypass.h>
>+
>+static LIST_HEAD(bypass_master_list);
>+static DEFINE_SPINLOCK(bypass_lock);
>+
>+static int bypass_slave_pre_register(struct net_device *slave_netdev,
>+				     struct net_device *bypass_netdev,
>+				     struct bypass_ops *bypass_ops)
>+{
>+	struct bypass_info *bi;
>+	bool backup;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_pre_register)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_pre_register(slave_netdev,
>+						      bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+	if (backup ? rtnl_dereference(bi->backup_netdev) :
>+			rtnl_dereference(bi->active_netdev)) {
>+		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>+			   slave_netdev->name, backup ? "backup" : "active");
>+		return -EEXIST;
>+	}
>+
>+	/* Avoid non pci devices as active netdev */
>+	if (!backup && (!slave_netdev->dev.parent ||
>+			!dev_is_pci(slave_netdev->dev.parent)))
>+		return -EINVAL;
>+
>+	return 0;
>+}
>+
>+static int bypass_slave_join(struct net_device *slave_netdev,
>+			     struct net_device *bypass_netdev,
>+			     struct bypass_ops *bypass_ops)
>+{
>+	struct bypass_info *bi;
>+	bool backup;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_join)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+
>+	dev_hold(slave_netdev);
>+
>+	if (backup) {
>+		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>+		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>+	} else {
>+		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>+		dev_get_stats(bi->active_netdev, &bi->active_stats);
>+		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>+		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>+	}
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>+		    slave_netdev->name);
>+
>+	return 0;
>+}
>+
>+/* Called when slave dev is injecting data into network stack.
>+ * Change the associated network device from lower dev to virtio.
>+ * note: already called with rcu_read_lock
>+ */
>+static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>+{
>+	struct sk_buff *skb = *pskb;
>+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>+
>+	skb->dev = ndev;
>+
>+	return RX_HANDLER_ANOTHER;
>+}
>+
>+static struct net_device *bypass_master_get_bymac(u8 *mac,
>+						  struct bypass_ops **ops)
>+{
>+	struct bypass_master *bypass_master;
>+	struct net_device *bypass_netdev;
>+
>+	spin_lock(&bypass_lock);
>+	list_for_each_entry(bypass_master, &bypass_master_list, list) {

As I wrote the last time, you don't need this list, spinlock.
You can do just something like:
        for_each_net(net) {
                for_each_netdev(net, dev) {
			if (netif_is_bypass_master(dev)) {




>+		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>+			*ops = rcu_dereference(bypass_master->ops);

I don't see how rcu_dereference is ok here.
1) I don't see rcu_read_lock taken
2) Looks like bypass_master->ops has the same value across the whole
   existence.


>+			spin_unlock(&bypass_lock);
>+			return bypass_netdev;
>+		}
>+	}
>+	spin_unlock(&bypass_lock);
>+	return NULL;
>+}
>+
>+static int bypass_slave_register(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev;
>+	struct bypass_ops *bypass_ops;
>+	int ret, orig_mtu;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);

For master, could you use word "master" in the variables so it is clear?
Also, "dev" is fine instead of "netdev".
Something like "bpmaster_dev"


>+	if (!bypass_netdev)
>+		goto done;
>+
>+	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>+					bypass_ops);
>+	if (ret != 0)

	Just "if (ret)" will do. You have this on more places.


>+		goto done;
>+
>+	ret = netdev_rx_handler_register(slave_netdev,
>+					 bypass_ops ? bypass_ops->handle_frame :
>+					 bypass_handle_frame, bypass_netdev);
>+	if (ret != 0) {
>+		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>+			   ret);
>+		goto done;
>+	}
>+
>+	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>+	if (ret != 0) {
>+		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>+			   bypass_netdev->name, ret);
>+		goto upper_link_failed;
>+	}
>+
>+	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>+
>+	if (netif_running(bypass_netdev)) {
>+		ret = dev_open(slave_netdev);
>+		if (ret && (ret != -EBUSY)) {
>+			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>+				   slave_netdev->name, ret);
>+			goto err_interface_up;
>+		}
>+	}
>+
>+	/* Align MTU of slave with master */
>+	orig_mtu = slave_netdev->mtu;
>+	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>+	if (ret != 0) {
>+		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>+			   slave_netdev->name, bypass_netdev->mtu);
>+		goto err_set_mtu;
>+	}
>+
>+	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>+	if (ret != 0)
>+		goto err_join;
>+
>+	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>+		    slave_netdev->name);
>+
>+	goto done;
>+
>+err_join:
>+	dev_set_mtu(slave_netdev, orig_mtu);
>+err_set_mtu:
>+	dev_close(slave_netdev);
>+err_interface_up:
>+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+upper_link_failed:
>+	netdev_rx_handler_unregister(slave_netdev);
>+done:
>+	return NOTIFY_DONE;
>+}
>+
>+static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>+				       struct net_device *bypass_netdev,
>+				       struct bypass_ops *bypass_ops)
>+{
>+	struct net_device *backup_netdev, *active_netdev;
>+	struct bypass_info *bi;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_pre_unregister)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_pre_unregister(slave_netdev,
>+							bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+		return -EINVAL;
>+
>+	return 0;
>+}
>+
>+static int bypass_slave_release(struct net_device *slave_netdev,
>+				struct net_device *bypass_netdev,
>+				struct bypass_ops *bypass_ops)
>+{
>+	struct net_device *backup_netdev, *active_netdev;
>+	struct bypass_info *bi;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_release)
>+			return -EINVAL;

I think it would be good to make the API to the driver more strict and
have a separate set of ops for "active" and "backup" netdevices.
That should stop people thinking about extending this to more slaves in
the future.



>+
>+		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev == backup_netdev) {
>+		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>+	} else {
>+		RCU_INIT_POINTER(bi->active_netdev, NULL);
>+		if (backup_netdev) {
>+			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+		}
>+	}
>+
>+	dev_put(slave_netdev);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>+		    slave_netdev->name);
>+
>+	return 0;
>+}
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev;
>+	struct bypass_ops *bypass_ops;
>+	int ret;
>+
>+	if (!netif_is_bypass_slave(slave_netdev))
>+		goto done;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);
>+	if (!bypass_netdev)
>+		goto done;
>+
>+	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>+					  bypass_ops);
>+	if (ret != 0)
>+		goto done;
>+
>+	netdev_rx_handler_unregister(slave_netdev);
>+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+
>+	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>+		    slave_netdev->name);
>+
>+done:
>+	return NOTIFY_DONE;
>+}
>+EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>+
>+static bool bypass_xmit_ready(struct net_device *dev)
>+{
>+	return netif_running(dev) && netif_carrier_ok(dev);
>+}
>+
>+static int bypass_slave_link_change(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>+	struct bypass_ops *bypass_ops;
>+	struct bypass_info *bi;
>+
>+	if (!netif_is_bypass_slave(slave_netdev))
>+		goto done;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);
>+	if (!bypass_netdev)
>+		goto done;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_link_change)
>+			goto done;
>+
>+		return bypass_ops->slave_link_change(slave_netdev,
>+						     bypass_netdev);
>+	}
>+
>+	if (!netif_running(bypass_netdev))
>+		return 0;
>+
>+	bi = netdev_priv(bypass_netdev);
>+
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+		goto done;

You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
above is enough.


>+
>+	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>+	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>+		netif_carrier_on(bypass_netdev);
>+		netif_tx_wake_all_queues(bypass_netdev);
>+	} else {
>+		netif_carrier_off(bypass_netdev);
>+		netif_tx_stop_all_queues(bypass_netdev);
>+	}
>+
>+done:
>+	return NOTIFY_DONE;
>+}
>+
>+static bool bypass_validate_event_dev(struct net_device *dev)
>+{
>+	/* Skip parent events */
>+	if (netif_is_bypass_master(dev))
>+		return false;
>+
>+	/* Avoid non-Ethernet type devices */
>+	if (dev->type != ARPHRD_ETHER)
>+		return false;
>+
>+	/* Avoid Vlan dev with same MAC registering as VF */
>+	if (is_vlan_dev(dev))
>+		return false;
>+
>+	/* Avoid Bonding master dev with same MAC registering as slave dev */
>+	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))

Yeah, this is certainly incorrect. One thing is, you should be using the
helpers netif_is_bond_master().
But what about the rest? macsec, macvlan, team, bridge, ovs and others?

You need to do it not by blacklisting, but with whitelisting. You need
to whitelist VF devices. My port flavours patchset might help with this.


>+		return false;
>+
>+	return true;
>+}
>+
>+static int
>+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>+{
>+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+
>+	if (!bypass_validate_event_dev(event_dev))
>+		return NOTIFY_DONE;
>+
>+	switch (event) {
>+	case NETDEV_REGISTER:
>+		return bypass_slave_register(event_dev);
>+	case NETDEV_UNREGISTER:
>+		return bypass_slave_unregister(event_dev);
>+	case NETDEV_UP:
>+	case NETDEV_DOWN:
>+	case NETDEV_CHANGE:
>+		return bypass_slave_link_change(event_dev);
>+	default:
>+		return NOTIFY_DONE;
>+	}
>+}
>+
>+static struct notifier_block bypass_notifier = {
>+	.notifier_call = bypass_event,
>+};
>+
>+int bypass_open(struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *active_netdev, *backup_netdev;
>+	int err;
>+
>+	netif_carrier_off(dev);
>+	netif_tx_wake_all_queues(dev);
>+
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	if (active_netdev) {
>+		err = dev_open(active_netdev);
>+		if (err)
>+			goto err_active_open;
>+	}
>+
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+	if (backup_netdev) {
>+		err = dev_open(backup_netdev);
>+		if (err)
>+			goto err_backup_open;
>+	}
>+
>+	return 0;
>+
>+err_backup_open:
>+	dev_close(active_netdev);
>+err_active_open:
>+	netif_tx_disable(dev);
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_open);
>+
>+int bypass_close(struct net_device *dev)
>+{
>+	struct bypass_info *vi = netdev_priv(dev);

This should be probably "bi"


>+	struct net_device *slave_netdev;
>+
>+	netif_tx_disable(dev);
>+
>+	slave_netdev = rtnl_dereference(vi->active_netdev);
>+	if (slave_netdev)
>+		dev_close(slave_netdev);
>+
>+	slave_netdev = rtnl_dereference(vi->backup_netdev);
>+	if (slave_netdev)
>+		dev_close(slave_netdev);
>+
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_close);
>+
>+static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+	atomic_long_inc(&dev->tx_dropped);
>+	dev_kfree_skb_any(skb);
>+	return NETDEV_TX_OK;
>+}
>+
>+netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);

If you rename the other variable to "bpmaster_dev", it would be nice to
rename this to bpinfo or something more descriptive. "bi" is too short
to know what that is right away.


>+	struct net_device *xmit_dev;

Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.



>+
>+	/* Try xmit via active netdev followed by backup netdev */
>+	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>+	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>+		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>+		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>+			return bypass_drop_xmit(skb, dev);
>+	}
>+
>+	skb->dev = xmit_dev;
>+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>+
>+	return dev_queue_xmit(skb);
>+}
>+EXPORT_SYMBOL_GPL(bypass_start_xmit);
>+
>+u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>+			void *accel_priv, select_queue_fallback_t fallback)
>+{
>+	/* This helper function exists to help dev_pick_tx get the correct
>+	 * destination queue.  Using a helper function skips a call to
>+	 * skb_tx_hash and will put the skbs in the queue we expect on their
>+	 * way down to the bonding driver.
>+	 */
>+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>+
>+	/* Save the original txq to restore before passing to the driver */
>+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>+
>+	if (unlikely(txq >= dev->real_num_tx_queues)) {
>+		do {
>+			txq -= dev->real_num_tx_queues;
>+		} while (txq >= dev->real_num_tx_queues);
>+	}
>+
>+	return txq;
>+}
>+EXPORT_SYMBOL_GPL(bypass_select_queue);
>+
>+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>+ * that some drivers can provide 32bit values only.
>+ */
>+static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>+			      const struct rtnl_link_stats64 *_new,
>+			      const struct rtnl_link_stats64 *_old)
>+{
>+	const u64 *new = (const u64 *)_new;
>+	const u64 *old = (const u64 *)_old;
>+	u64 *res = (u64 *)_res;
>+	int i;
>+
>+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>+		u64 nv = new[i];
>+		u64 ov = old[i];
>+		s64 delta = nv - ov;
>+
>+		/* detects if this particular field is 32bit only */
>+		if (((nv | ov) >> 32) == 0)
>+			delta = (s64)(s32)((u32)nv - (u32)ov);
>+
>+		/* filter anomalies, some drivers reset their stats
>+		 * at down/up events.
>+		 */
>+		if (delta > 0)
>+			res[i] += delta;
>+	}
>+}
>+
>+void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);

You can WARN_ON and return in case the dev is not bypass master, just
to catch buggy drivers. Same with other helpers.


>+	const struct rtnl_link_stats64 *new;
>+	struct rtnl_link_stats64 temp;
>+	struct net_device *slave_netdev;
>+
>+	spin_lock(&bi->stats_lock);
>+	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>+
>+	rcu_read_lock();
>+
>+	slave_netdev = rcu_dereference(bi->active_netdev);
>+	if (slave_netdev) {
>+		new = dev_get_stats(slave_netdev, &temp);
>+		bypass_fold_stats(stats, new, &bi->active_stats);
>+		memcpy(&bi->active_stats, new, sizeof(*new));
>+	}
>+
>+	slave_netdev = rcu_dereference(bi->backup_netdev);
>+	if (slave_netdev) {
>+		new = dev_get_stats(slave_netdev, &temp);
>+		bypass_fold_stats(stats, new, &bi->backup_stats);
>+		memcpy(&bi->backup_stats, new, sizeof(*new));
>+	}
>+
>+	rcu_read_unlock();
>+
>+	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>+	spin_unlock(&bi->stats_lock);
>+}
>+EXPORT_SYMBOL_GPL(bypass_get_stats);
>+
>+int bypass_change_mtu(struct net_device *dev, int new_mtu)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *active_netdev, *backup_netdev;
>+	int ret = 0;

Pointless initialization.


>+
>+	active_netdev = rcu_dereference(bi->active_netdev);
>+	if (active_netdev) {
>+		ret = dev_set_mtu(active_netdev, new_mtu);
>+		if (ret)
>+			return ret;
>+	}
>+
>+	backup_netdev = rcu_dereference(bi->backup_netdev);
>+	if (backup_netdev) {
>+		ret = dev_set_mtu(backup_netdev, new_mtu);
>+		if (ret) {
>+			dev_set_mtu(active_netdev, dev->mtu);
>+			return ret;
>+		}
>+	}
>+
>+	dev->mtu = new_mtu;
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_change_mtu);
>+
>+void bypass_set_rx_mode(struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *slave_netdev;
>+
>+	rcu_read_lock();
>+
>+	slave_netdev = rcu_dereference(bi->active_netdev);
>+	if (slave_netdev) {
>+		dev_uc_sync_multiple(slave_netdev, dev);
>+		dev_mc_sync_multiple(slave_netdev, dev);
>+	}
>+
>+	slave_netdev = rcu_dereference(bi->backup_netdev);
>+	if (slave_netdev) {
>+		dev_uc_sync_multiple(slave_netdev, dev);
>+		dev_mc_sync_multiple(slave_netdev, dev);
>+	}
>+
>+	rcu_read_unlock();
>+}
>+EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>+
>+static const struct net_device_ops bypass_netdev_ops = {
>+	.ndo_open		= bypass_open,
>+	.ndo_stop		= bypass_close,
>+	.ndo_start_xmit		= bypass_start_xmit,
>+	.ndo_select_queue	= bypass_select_queue,
>+	.ndo_get_stats64	= bypass_get_stats,
>+	.ndo_change_mtu		= bypass_change_mtu,
>+	.ndo_set_rx_mode	= bypass_set_rx_mode,
>+	.ndo_validate_addr	= eth_validate_addr,
>+	.ndo_features_check	= passthru_features_check,
>+};
>+
>+#define BYPASS_DRV_NAME "bypass"
>+#define BYPASS_DRV_VERSION "0.1"
>+
>+static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>+				       struct ethtool_drvinfo *drvinfo)
>+{
>+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>+}
>+
>+int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>+				      struct ethtool_link_ksettings *cmd)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *slave_netdev;
>+
>+	slave_netdev = rtnl_dereference(bi->active_netdev);
>+	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+		slave_netdev = rtnl_dereference(bi->backup_netdev);
>+		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+			cmd->base.duplex = DUPLEX_UNKNOWN;
>+			cmd->base.port = PORT_OTHER;
>+			cmd->base.speed = SPEED_UNKNOWN;
>+
>+			return 0;
>+		}
>+	}
>+
>+	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>+}
>+EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>+
>+static const struct ethtool_ops bypass_ethtool_ops = {
>+	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>+	.get_link               = ethtool_op_get_link,
>+	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>+};
>+
>+static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>+{
>+	struct net *net = dev_net(bypass_netdev);
>+	struct net_device *dev;
>+
>+	rtnl_lock();
>+	for_each_netdev(net, dev) {
>+		if (dev == bypass_netdev)
>+			continue;
>+		if (!bypass_validate_event_dev(dev))
>+			continue;
>+		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>+			bypass_slave_register(dev);
>+	}
>+	rtnl_unlock();
>+}
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct bypass_master **pbypass_master)
>+{
>+	struct bypass_master *bypass_master;
>+
>+	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>+	if (!bypass_master)
>+		return -ENOMEM;
>+
>+	rcu_assign_pointer(bypass_master->ops, ops);
>+	dev_hold(dev);
>+	dev->priv_flags |= IFF_BYPASS;
>+	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>+
>+	spin_lock(&bypass_lock);
>+	list_add_tail(&bypass_master->list, &bypass_master_list);
>+	spin_unlock(&bypass_lock);
>+
>+	bypass_register_existing_slave(dev);
>+
>+	*pbypass_master = bypass_master;
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_register);
>+
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+	struct net_device *bypass_netdev;
>+
>+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+
>+	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>+	dev_put(bypass_netdev);
>+
>+	spin_lock(&bypass_lock);
>+	list_del(&bypass_master->list);
>+	spin_unlock(&bypass_lock);
>+
>+	kfree(bypass_master);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_unregister);
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master)
>+{
>+	struct device *dev = backup_netdev->dev.parent;
>+	struct net_device *bypass_netdev;
>+	int err;
>+
>+	/* Alloc at least 2 queues, for now we are going with 16 assuming
>+	 * that most devices being bonded won't have too many queues.
>+	 */
>+	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>+	if (!bypass_netdev) {
>+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+		return -ENOMEM;
>+	}
>+
>+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+	SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>+	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>+
>+	/* Initialize the device options */
>+	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>+				       IFF_TX_SKB_SHARING);
>+
>+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>+	bypass_netdev->features |= NETIF_F_LLTX;
>+
>+	/* Don't allow bypass devices to change network namespaces. */
>+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>+
>+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>+
>+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>+	bypass_netdev->features |= bypass_netdev->hw_features;
>+
>+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>+	       bypass_netdev->addr_len);
>+
>+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+
>+	err = register_netdev(bypass_netdev);
>+	if (err < 0) {
>+		dev_err(dev, "Unable to register bypass_netdev!\n");
>+		goto err_register_netdev;
>+	}
>+
>+	netif_carrier_off(bypass_netdev);
>+
>+	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>+	if (err < 0)

just "if (err)" would do.


>+		goto err_bypass;
>+
>+	return 0;
>+
>+err_bypass:
>+	unregister_netdev(bypass_netdev);
>+err_register_netdev:
>+	free_netdev(bypass_netdev);
>+
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_create);
>+
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+	struct net_device *bypass_netdev;
>+	struct net_device *slave_netdev;
>+	struct bypass_info *bi;
>+
>+	if (!bypass_master)
>+		return;
>+
>+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+	bi = netdev_priv(bypass_netdev);
>+
>+	netif_device_detach(bypass_netdev);
>+
>+	rtnl_lock();
>+
>+	slave_netdev = rtnl_dereference(bi->active_netdev);
>+	if (slave_netdev)
>+		bypass_slave_unregister(slave_netdev);
>+
>+	slave_netdev = rtnl_dereference(bi->backup_netdev);
>+	if (slave_netdev)
>+		bypass_slave_unregister(slave_netdev);
>+
>+	bypass_master_unregister(bypass_master);
>+
>+	unregister_netdevice(bypass_netdev);
>+
>+	rtnl_unlock();
>+
>+	free_netdev(bypass_netdev);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_destroy);
>+
>+static __init int
>+bypass_init(void)
>+{
>+	register_netdevice_notifier(&bypass_notifier);
>+
>+	return 0;
>+}
>+module_init(bypass_init);
>+
>+static __exit
>+void bypass_exit(void)
>+{
>+	unregister_netdevice_notifier(&bypass_notifier);
>+}
>+module_exit(bypass_exit);
>+
>+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>+MODULE_LICENSE("GPL v2");
>-- 
>2.14.3
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]   ` <20180411155127.GQ2028@nanopsycho>
@ 2018-04-11 19:13     ` Samudrala, Sridhar
       [not found]     ` <6a8c1ff5-153a-e40a-91b3-48532b8d3a38@intel.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Samudrala, Sridhar @ 2018-04-11 19:13 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem

On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation.
>>
>> It exposes 2 sets of interfaces to the paravirtual drivers.
>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> master netdev is created. The paravirtual driver registers each bypass
>> instance along with a set of ops to manage the slave events.
>>      bypass_master_register()
>>      bypass_master_unregister()
>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> the bypass module provides interfaces to create/destroy additional master
>> netdev and all the slave events are managed internally.
>>       bypass_master_create()
>>       bypass_master_destroy()
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>> include/linux/netdevice.h |  14 +
>> include/net/bypass.h      |  96 ++++++
>> net/Kconfig               |  18 +
>> net/core/Makefile         |   1 +
>> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 973 insertions(+)
>> create mode 100644 include/net/bypass.h
>> create mode 100644 net/core/bypass.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cf44503ea81a..587293728f70 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> 	IFF_PHONY_HEADROOM		= 1<<24,
>> 	IFF_MACSEC			= 1<<25,
>> 	IFF_NO_RX_HANDLER		= 1<<26,
>> +	IFF_BYPASS			= 1 << 27,
>> +	IFF_BYPASS_SLAVE		= 1 << 28,
> I wonder, why you don't follow the existing coding style... Also, please
> add these to into the comment above.

To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
to the existing coding style to be consistent.

>
>
>> };
>>
>> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> #define IFF_MACSEC			IFF_MACSEC
>> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> +#define IFF_BYPASS			IFF_BYPASS
>> +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>>
>> /**
>>   *	struct net_device - The DEVICE structure.
>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> }
>>
>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_BYPASS;
>> +}
>> +
>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> +}
>> +
>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> static inline void netif_keep_dst(struct net_device *dev)
>> {
>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>> new file mode 100644
>> index 000000000000..86b02cb894cf
>> --- /dev/null
>> +++ b/include/net/bypass.h
>> @@ -0,0 +1,96 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_BYPASS_H
>> +#define _NET_BYPASS_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct bypass_ops {
>> +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> +				  struct net_device *bypass_netdev);
>> +	int (*slave_join)(struct net_device *slave_netdev,
>> +			  struct net_device *bypass_netdev);
>> +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> +				    struct net_device *bypass_netdev);
>> +	int (*slave_release)(struct net_device *slave_netdev,
>> +			     struct net_device *bypass_netdev);
>> +	int (*slave_link_change)(struct net_device *slave_netdev,
>> +				 struct net_device *bypass_netdev);
>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct bypass_master {
>> +	struct list_head list;
>> +	struct net_device __rcu *bypass_netdev;
>> +	struct bypass_ops __rcu *ops;
>> +};
>> +
>> +/* bypass state */
>> +struct bypass_info {
>> +	/* passthru netdev with same MAC */
>> +	struct net_device __rcu *active_netdev;
> You still use "active"/"backup" names which is highly misleading as
> it has completely different meaning that in bond for example.
> I noted that in my previous review already. Please change it.

I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
matches with the BACKUP feature bit we are adding to virtio_net.

With regards to alternate names for 'active', you suggested 'stolen', but i
am not too happy with it.
netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'



>
>
>> +
>> +	/* virtio_net netdev */
>> +	struct net_device __rcu *backup_netdev;
>> +
>> +	/* active netdev stats */
>> +	struct rtnl_link_stats64 active_stats;
>> +
>> +	/* backup netdev stats */
>> +	struct rtnl_link_stats64 backup_stats;
>> +
>> +	/* aggregated stats */
>> +	struct rtnl_link_stats64 bypass_stats;
>> +
>> +	/* spinlock while updating stats */
>> +	spinlock_t stats_lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_BYPASS)
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master);
>> +void bypass_master_destroy(struct bypass_master *bypass_master);
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct bypass_master **pbypass_master);
>> +void bypass_master_unregister(struct bypass_master *bypass_master);
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev);
>> +
>> +#else
>> +
>> +static inline
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct pbypass_master **pbypass_master);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_BYPASS_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..994445f4a96a 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> 	  devlink is a loadable module and the driver using it is built-in.
>>
>> +config NET_BYPASS
>> +	tristate "Bypass interface"
>> +	---help---
>> +	  This provides a generic interface for paravirtual drivers to listen
>> +	  for netdev register/unregister/link change events from pci ethernet
>> +	  devices with the same MAC and takeover their datapath. This also
>> +	  enables live migration of a VM with direct attached VF by failing
>> +	  over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_BYPASS
>> +	tristate
>> +	default m if NET_BYPASS=m
>> +	default y if NET_BYPASS=y || NET_BYPASS=n
>> +	help
>> +	  Drivers using the bypass infrastructure should have a dependency
>> +	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>> +	  bypass is a loadable module and the driver using it is built-in.
>> +
>> endif   # if NET
>>
>> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..a9727ed1c8fc 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> obj-$(CONFIG_HWBM) += hwbm.o
>> obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_BYPASS) += bypass.o
>> diff --git a/net/core/bypass.c b/net/core/bypass.c
>> new file mode 100644
>> index 000000000000..b5b9cb554c3f
>> --- /dev/null
>> +++ b/net/core/bypass.c
>> @@ -0,0 +1,844 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/pci.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/bypass.h>
>> +
>> +static LIST_HEAD(bypass_master_list);
>> +static DEFINE_SPINLOCK(bypass_lock);
>> +
>> +static int bypass_slave_pre_register(struct net_device *slave_netdev,
>> +				     struct net_device *bypass_netdev,
>> +				     struct bypass_ops *bypass_ops)
>> +{
>> +	struct bypass_info *bi;
>> +	bool backup;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_pre_register)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_pre_register(slave_netdev,
>> +						      bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +	if (backup ? rtnl_dereference(bi->backup_netdev) :
>> +			rtnl_dereference(bi->active_netdev)) {
>> +		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>> +			   slave_netdev->name, backup ? "backup" : "active");
>> +		return -EEXIST;
>> +	}
>> +
>> +	/* Avoid non pci devices as active netdev */
>> +	if (!backup && (!slave_netdev->dev.parent ||
>> +			!dev_is_pci(slave_netdev->dev.parent)))
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int bypass_slave_join(struct net_device *slave_netdev,
>> +			     struct net_device *bypass_netdev,
>> +			     struct bypass_ops *bypass_ops)
>> +{
>> +	struct bypass_info *bi;
>> +	bool backup;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_join)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +
>> +	dev_hold(slave_netdev);
>> +
>> +	if (backup) {
>> +		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>> +		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>> +	} else {
>> +		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>> +		dev_get_stats(bi->active_netdev, &bi->active_stats);
>> +		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>> +		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>> +	}
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>> +		    slave_netdev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +/* Called when slave dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> +	skb->dev = ndev;
>> +
>> +	return RX_HANDLER_ANOTHER;
>> +}
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> +						  struct bypass_ops **ops)
>> +{
>> +	struct bypass_master *bypass_master;
>> +	struct net_device *bypass_netdev;
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
> As I wrote the last time, you don't need this list, spinlock.
> You can do just something like:
>          for_each_net(net) {
>                  for_each_netdev(net, dev) {
> 			if (netif_is_bypass_master(dev)) {

This function returns the upper netdev as well as the ops associated
with that netdev.
bypass_master_list is a list of 'struct bypass_master' that associates
'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
We need 'ops' only to support the 2 netdev model of netvsc. ops will be
NULL for 3-netdev model.


>
>
>
>
>> +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> +			*ops = rcu_dereference(bypass_master->ops);
> I don't see how rcu_dereference is ok here.
> 1) I don't see rcu_read_lock taken
> 2) Looks like bypass_master->ops has the same value across the whole
>     existence.

We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
Yes. ops doesn't change.

>
>
>> +			spin_unlock(&bypass_lock);
>> +			return bypass_netdev;
>> +		}
>> +	}
>> +	spin_unlock(&bypass_lock);
>> +	return NULL;
>> +}
>> +
>> +static int bypass_slave_register(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	int ret, orig_mtu;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
> For master, could you use word "master" in the variables so it is clear?
> Also, "dev" is fine instead of "netdev".
> Something like "bpmaster_dev"

bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
I can change all _netdev suffixes to _dev to make the names shorter.


>
>
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> +					bypass_ops);
>> +	if (ret != 0)
> 	Just "if (ret)" will do. You have this on more places.

OK.


>
>
>> +		goto done;
>> +
>> +	ret = netdev_rx_handler_register(slave_netdev,
>> +					 bypass_ops ? bypass_ops->handle_frame :
>> +					 bypass_handle_frame, bypass_netdev);
>> +	if (ret != 0) {
>> +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> +			   ret);
>> +		goto done;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> +	if (ret != 0) {
>> +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> +			   bypass_netdev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> +
>> +	if (netif_running(bypass_netdev)) {
>> +		ret = dev_open(slave_netdev);
>> +		if (ret && (ret != -EBUSY)) {
>> +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> +				   slave_netdev->name, ret);
>> +			goto err_interface_up;
>> +		}
>> +	}
>> +
>> +	/* Align MTU of slave with master */
>> +	orig_mtu = slave_netdev->mtu;
>> +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> +	if (ret != 0) {
>> +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> +			   slave_netdev->name, bypass_netdev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> +	if (ret != 0)
>> +		goto err_join;
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> +		    slave_netdev->name);
>> +
>> +	goto done;
>> +
>> +err_join:
>> +	dev_set_mtu(slave_netdev, orig_mtu);
>> +err_set_mtu:
>> +	dev_close(slave_netdev);
>> +err_interface_up:
>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(slave_netdev);
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> +				       struct net_device *bypass_netdev,
>> +				       struct bypass_ops *bypass_ops)
>> +{
>> +	struct net_device *backup_netdev, *active_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_pre_unregister)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> +							bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int bypass_slave_release(struct net_device *slave_netdev,
>> +				struct net_device *bypass_netdev,
>> +				struct bypass_ops *bypass_ops)
>> +{
>> +	struct net_device *backup_netdev, *active_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_release)
>> +			return -EINVAL;
> I think it would be good to make the API to the driver more strict and
> have a separate set of ops for "active" and "backup" netdevices.
> That should stop people thinking about extending this to more slaves in
> the future.

We have checks in slave_pre_register() that allows only 1 'backup' and 1
'active' slave.


>
>
>
>> +
>> +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev == backup_netdev) {
>> +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> +		if (backup_netdev) {
>> +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(slave_netdev);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> +		    slave_netdev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	int ret;
>> +
>> +	if (!netif_is_bypass_slave(slave_netdev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> +					  bypass_ops);
>> +	if (ret != 0)
>> +		goto done;
>> +
>> +	netdev_rx_handler_unregister(slave_netdev);
>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +
>> +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> +		    slave_netdev->name);
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> +
>> +static bool bypass_xmit_ready(struct net_device *dev)
>> +{
>> +	return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	struct bypass_info *bi;
>> +
>> +	if (!netif_is_bypass_slave(slave_netdev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_link_change)
>> +			goto done;
>> +
>> +		return bypass_ops->slave_link_change(slave_netdev,
>> +						     bypass_netdev);
>> +	}
>> +
>> +	if (!netif_running(bypass_netdev))
>> +		return 0;
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> +		goto done;
> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> above is enough.

I think we need this check to not allow events from a slave that is not
attached to this master but has the same MAC.

>
>
>> +
>> +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> +		netif_carrier_on(bypass_netdev);
>> +		netif_tx_wake_all_queues(bypass_netdev);
>> +	} else {
>> +		netif_carrier_off(bypass_netdev);
>> +		netif_tx_stop_all_queues(bypass_netdev);
>> +	}
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static bool bypass_validate_event_dev(struct net_device *dev)
>> +{
>> +	/* Skip parent events */
>> +	if (netif_is_bypass_master(dev))
>> +		return false;
>> +
>> +	/* Avoid non-Ethernet type devices */
>> +	if (dev->type != ARPHRD_ETHER)
>> +		return false;
>> +
>> +	/* Avoid Vlan dev with same MAC registering as VF */
>> +	if (is_vlan_dev(dev))
>> +		return false;
>> +
>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> Yeah, this is certainly incorrect. One thing is, you should be using the
> helpers netif_is_bond_master().
> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>
> You need to do it not by blacklisting, but with whitelisting. You need
> to whitelist VF devices. My port flavours patchset might help with this.

May be i can use netdev_has_lower_dev() helper to make sure that the slave
device is not an upper dev.
Can you point to your port flavours patchset? Is it upstream?

>
>
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int
>> +bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	if (!bypass_validate_event_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return bypass_slave_register(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return bypass_slave_unregister(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return bypass_slave_link_change(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
>> +
>> +static struct notifier_block bypass_notifier = {
>> +	.notifier_call = bypass_event,
>> +};
>> +
>> +int bypass_open(struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *active_netdev, *backup_netdev;
>> +	int err;
>> +
>> +	netif_carrier_off(dev);
>> +	netif_tx_wake_all_queues(dev);
>> +
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (active_netdev) {
>> +		err = dev_open(active_netdev);
>> +		if (err)
>> +			goto err_active_open;
>> +	}
>> +
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +	if (backup_netdev) {
>> +		err = dev_open(backup_netdev);
>> +		if (err)
>> +			goto err_backup_open;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_backup_open:
>> +	dev_close(active_netdev);
>> +err_active_open:
>> +	netif_tx_disable(dev);
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_open);
>> +
>> +int bypass_close(struct net_device *dev)
>> +{
>> +	struct bypass_info *vi = netdev_priv(dev);
> This should be probably "bi"

Yes.


>
>
>> +	struct net_device *slave_netdev;
>> +
>> +	netif_tx_disable(dev);
>> +
>> +	slave_netdev = rtnl_dereference(vi->active_netdev);
>> +	if (slave_netdev)
>> +		dev_close(slave_netdev);
>> +
>> +	slave_netdev = rtnl_dereference(vi->backup_netdev);
>> +	if (slave_netdev)
>> +		dev_close(slave_netdev);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_close);
>> +
>> +static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	atomic_long_inc(&dev->tx_dropped);
>> +	dev_kfree_skb_any(skb);
>> +	return NETDEV_TX_OK;
>> +}
>> +
>> +netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
> If you rename the other variable to "bpmaster_dev", it would be nice to
> rename this to bpinfo or something more descriptive. "bi" is too short
> to know what that is right away.

Will rename bypass_netdev to bypass_dev. bypass indicates that it is
an upper master dev.


>
>
>> +	struct net_device *xmit_dev;
> Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.

OK.


>
>
>
>> +
>> +	/* Try xmit via active netdev followed by backup netdev */
>> +	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>> +	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>> +		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>> +		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>> +			return bypass_drop_xmit(skb, dev);
>> +	}
>> +
>> +	skb->dev = xmit_dev;
>> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> +	return dev_queue_xmit(skb);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_start_xmit);
>> +
>> +u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>> +			void *accel_priv, select_queue_fallback_t fallback)
>> +{
>> +	/* This helper function exists to help dev_pick_tx get the correct
>> +	 * destination queue.  Using a helper function skips a call to
>> +	 * skb_tx_hash and will put the skbs in the queue we expect on their
>> +	 * way down to the bonding driver.
>> +	 */
>> +	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> +	/* Save the original txq to restore before passing to the driver */
>> +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +	if (unlikely(txq >= dev->real_num_tx_queues)) {
>> +		do {
>> +			txq -= dev->real_num_tx_queues;
>> +		} while (txq >= dev->real_num_tx_queues);
>> +	}
>> +
>> +	return txq;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_select_queue);
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32bit values only.
>> + */
>> +static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> +			      const struct rtnl_link_stats64 *_new,
>> +			      const struct rtnl_link_stats64 *_old)
>> +{
>> +	const u64 *new = (const u64 *)_new;
>> +	const u64 *old = (const u64 *)_old;
>> +	u64 *res = (u64 *)_res;
>> +	int i;
>> +
>> +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> +		u64 nv = new[i];
>> +		u64 ov = old[i];
>> +		s64 delta = nv - ov;
>> +
>> +		/* detects if this particular field is 32bit only */
>> +		if (((nv | ov) >> 32) == 0)
>> +			delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> +		/* filter anomalies, some drivers reset their stats
>> +		 * at down/up events.
>> +		 */
>> +		if (delta > 0)
>> +			res[i] += delta;
>> +	}
>> +}
>> +
>> +void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
> You can WARN_ON and return in case the dev is not bypass master, just
> to catch buggy drivers. Same with other helpers.

I can make this static and not export this helper as well as all
bypass_netdev ops.

>
>
>> +	const struct rtnl_link_stats64 *new;
>> +	struct rtnl_link_stats64 temp;
>> +	struct net_device *slave_netdev;
>> +
>> +	spin_lock(&bi->stats_lock);
>> +	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_netdev = rcu_dereference(bi->active_netdev);
>> +	if (slave_netdev) {
>> +		new = dev_get_stats(slave_netdev, &temp);
>> +		bypass_fold_stats(stats, new, &bi->active_stats);
>> +		memcpy(&bi->active_stats, new, sizeof(*new));
>> +	}
>> +
>> +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (slave_netdev) {
>> +		new = dev_get_stats(slave_netdev, &temp);
>> +		bypass_fold_stats(stats, new, &bi->backup_stats);
>> +		memcpy(&bi->backup_stats, new, sizeof(*new));
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>> +	spin_unlock(&bi->stats_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_get_stats);
>> +
>> +int bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *active_netdev, *backup_netdev;
>> +	int ret = 0;
> Pointless initialization.
>
>
>> +
>> +	active_netdev = rcu_dereference(bi->active_netdev);
>> +	if (active_netdev) {
>> +		ret = dev_set_mtu(active_netdev, new_mtu);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	backup_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (backup_netdev) {
>> +		ret = dev_set_mtu(backup_netdev, new_mtu);
>> +		if (ret) {
>> +			dev_set_mtu(active_netdev, dev->mtu);
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	dev->mtu = new_mtu;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_change_mtu);
>> +
>> +void bypass_set_rx_mode(struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *slave_netdev;
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_netdev = rcu_dereference(bi->active_netdev);
>> +	if (slave_netdev) {
>> +		dev_uc_sync_multiple(slave_netdev, dev);
>> +		dev_mc_sync_multiple(slave_netdev, dev);
>> +	}
>> +
>> +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (slave_netdev) {
>> +		dev_uc_sync_multiple(slave_netdev, dev);
>> +		dev_mc_sync_multiple(slave_netdev, dev);
>> +	}
>> +
>> +	rcu_read_unlock();
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>> +
>> +static const struct net_device_ops bypass_netdev_ops = {
>> +	.ndo_open		= bypass_open,
>> +	.ndo_stop		= bypass_close,
>> +	.ndo_start_xmit		= bypass_start_xmit,
>> +	.ndo_select_queue	= bypass_select_queue,
>> +	.ndo_get_stats64	= bypass_get_stats,
>> +	.ndo_change_mtu		= bypass_change_mtu,
>> +	.ndo_set_rx_mode	= bypass_set_rx_mode,
>> +	.ndo_validate_addr	= eth_validate_addr,
>> +	.ndo_features_check	= passthru_features_check,
>> +};
>> +
>> +#define BYPASS_DRV_NAME "bypass"
>> +#define BYPASS_DRV_VERSION "0.1"
>> +
>> +static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>> +				       struct ethtool_drvinfo *drvinfo)
>> +{
>> +	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>> +	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>> +				      struct ethtool_link_ksettings *cmd)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *slave_netdev;
>> +
>> +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> +		slave_netdev = rtnl_dereference(bi->backup_netdev);
>> +		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> +			cmd->base.duplex = DUPLEX_UNKNOWN;
>> +			cmd->base.port = PORT_OTHER;
>> +			cmd->base.speed = SPEED_UNKNOWN;
>> +
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>> +
>> +static const struct ethtool_ops bypass_ethtool_ops = {
>> +	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>> +	.get_link               = ethtool_op_get_link,
>> +	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>> +};
>> +
>> +static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>> +{
>> +	struct net *net = dev_net(bypass_netdev);
>> +	struct net_device *dev;
>> +
>> +	rtnl_lock();
>> +	for_each_netdev(net, dev) {
>> +		if (dev == bypass_netdev)
>> +			continue;
>> +		if (!bypass_validate_event_dev(dev))
>> +			continue;
>> +		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>> +			bypass_slave_register(dev);
>> +	}
>> +	rtnl_unlock();
>> +}
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct bypass_master **pbypass_master)
>> +{
>> +	struct bypass_master *bypass_master;
>> +
>> +	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>> +	if (!bypass_master)
>> +		return -ENOMEM;
>> +
>> +	rcu_assign_pointer(bypass_master->ops, ops);
>> +	dev_hold(dev);
>> +	dev->priv_flags |= IFF_BYPASS;
>> +	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_add_tail(&bypass_master->list, &bypass_master_list);
>> +	spin_unlock(&bypass_lock);
>> +
>> +	bypass_register_existing_slave(dev);
>> +
>> +	*pbypass_master = bypass_master;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_register);
>> +
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +	struct net_device *bypass_netdev;
>> +
>> +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +
>> +	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>> +	dev_put(bypass_netdev);
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_del(&bypass_master->list);
>> +	spin_unlock(&bypass_lock);
>> +
>> +	kfree(bypass_master);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_unregister);
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master)
>> +{
>> +	struct device *dev = backup_netdev->dev.parent;
>> +	struct net_device *bypass_netdev;
>> +	int err;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>> +	if (!bypass_netdev) {
>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> +	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>> +	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> +	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> +				       IFF_TX_SKB_SHARING);
>> +
>> +	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>> +	bypass_netdev->features |= NETIF_F_LLTX;
>> +
>> +	/* Don't allow bypass devices to change network namespaces. */
>> +	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> +	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> +				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> +				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>> +
>> +	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> +	bypass_netdev->features |= bypass_netdev->hw_features;
>> +
>> +	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>> +	       bypass_netdev->addr_len);
>> +
>> +	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> +	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +
>> +	err = register_netdev(bypass_netdev);
>> +	if (err < 0) {
>> +		dev_err(dev, "Unable to register bypass_netdev!\n");
>> +		goto err_register_netdev;
>> +	}
>> +
>> +	netif_carrier_off(bypass_netdev);
>> +
>> +	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>> +	if (err < 0)
> just "if (err)" would do.

OK

>
>
>> +		goto err_bypass;
>> +
>> +	return 0;
>> +
>> +err_bypass:
>> +	unregister_netdev(bypass_netdev);
>> +err_register_netdev:
>> +	free_netdev(bypass_netdev);
>> +
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_create);
>> +
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct net_device *slave_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (!bypass_master)
>> +		return;
>> +
>> +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +	bi = netdev_priv(bypass_netdev);
>> +
>> +	netif_device_detach(bypass_netdev);
>> +
>> +	rtnl_lock();
>> +
>> +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (slave_netdev)
>> +		bypass_slave_unregister(slave_netdev);
>> +
>> +	slave_netdev = rtnl_dereference(bi->backup_netdev);
>> +	if (slave_netdev)
>> +		bypass_slave_unregister(slave_netdev);
>> +
>> +	bypass_master_unregister(bypass_master);
>> +
>> +	unregister_netdevice(bypass_netdev);
>> +
>> +	rtnl_unlock();
>> +
>> +	free_netdev(bypass_netdev);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_destroy);
>> +
>> +static __init int
>> +bypass_init(void)
>> +{
>> +	register_netdevice_notifier(&bypass_notifier);
>> +
>> +	return 0;
>> +}
>> +module_init(bypass_init);
>> +
>> +static __exit
>> +void bypass_exit(void)
>> +{
>> +	unregister_netdevice_notifier(&bypass_notifier);
>> +}
>> +module_exit(bypass_exit);
>> +
>> +MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> -- 
>> 2.14.3
>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]     ` <6a8c1ff5-153a-e40a-91b3-48532b8d3a38@intel.com>
@ 2018-04-18  9:25       ` Jiri Pirko
       [not found]       ` <20180418092515.GB1989@nanopsycho>
  1 sibling, 0 replies; 47+ messages in thread
From: Jiri Pirko @ 2018-04-18  9:25 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem

Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > This provides a generic interface for paravirtual drivers to listen
>> > for netdev register/unregister/link change events from pci ethernet
>> > devices with the same MAC and takeover their datapath. The notifier and
>> > event handling code is based on the existing netvsc implementation.
>> > 
>> > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > master netdev is created. The paravirtual driver registers each bypass
>> > instance along with a set of ops to manage the slave events.
>> >      bypass_master_register()
>> >      bypass_master_unregister()
>> > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> > the bypass module provides interfaces to create/destroy additional master
>> > netdev and all the slave events are managed internally.
>> >       bypass_master_create()
>> >       bypass_master_destroy()
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>> > include/linux/netdevice.h |  14 +
>> > include/net/bypass.h      |  96 ++++++
>> > net/Kconfig               |  18 +
>> > net/core/Makefile         |   1 +
>> > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> > 5 files changed, 973 insertions(+)
>> > create mode 100644 include/net/bypass.h
>> > create mode 100644 net/core/bypass.c
>> > 
>> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > index cf44503ea81a..587293728f70 100644
>> > --- a/include/linux/netdevice.h
>> > +++ b/include/linux/netdevice.h
>> > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> > 	IFF_PHONY_HEADROOM		= 1<<24,
>> > 	IFF_MACSEC			= 1<<25,
>> > 	IFF_NO_RX_HANDLER		= 1<<26,
>> > +	IFF_BYPASS			= 1 << 27,
>> > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> I wonder, why you don't follow the existing coding style... Also, please
>> add these to into the comment above.
>
>To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>to the existing coding style to be consistent.

Please do.


>
>> 
>> 
>> > };
>> > 
>> > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> > #define IFF_MACSEC			IFF_MACSEC
>> > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> > +#define IFF_BYPASS			IFF_BYPASS
>> > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> > 
>> > /**
>> >   *	struct net_device - The DEVICE structure.
>> > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > }
>> > 
>> > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> > +{
>> > +	return dev->priv_flags & IFF_BYPASS;
>> > +}
>> > +
>> > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > +{
>> > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> > +}
>> > +
>> > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> > static inline void netif_keep_dst(struct net_device *dev)
>> > {
>> > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> > new file mode 100644
>> > index 000000000000..86b02cb894cf
>> > --- /dev/null
>> > +++ b/include/net/bypass.h
>> > @@ -0,0 +1,96 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +/* Copyright (c) 2018, Intel Corporation. */
>> > +
>> > +#ifndef _NET_BYPASS_H
>> > +#define _NET_BYPASS_H
>> > +
>> > +#include <linux/netdevice.h>
>> > +
>> > +struct bypass_ops {
>> > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> > +				  struct net_device *bypass_netdev);
>> > +	int (*slave_join)(struct net_device *slave_netdev,
>> > +			  struct net_device *bypass_netdev);
>> > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> > +				    struct net_device *bypass_netdev);
>> > +	int (*slave_release)(struct net_device *slave_netdev,
>> > +			     struct net_device *bypass_netdev);
>> > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> > +				 struct net_device *bypass_netdev);
>> > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> > +};
>> > +
>> > +struct bypass_master {
>> > +	struct list_head list;
>> > +	struct net_device __rcu *bypass_netdev;
>> > +	struct bypass_ops __rcu *ops;
>> > +};
>> > +
>> > +/* bypass state */
>> > +struct bypass_info {
>> > +	/* passthru netdev with same MAC */
>> > +	struct net_device __rcu *active_netdev;
>> You still use "active"/"backup" names which is highly misleading as
>> it has completely different meaning that in bond for example.
>> I noted that in my previous review already. Please change it.
>
>I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>matches with the BACKUP feature bit we are adding to virtio_net.

I think that "backup" is also misleading. Both "active" and "backup"
mean a *state* of slaves. This should be named differently.



>
>With regards to alternate names for 'active', you suggested 'stolen', but i
>am not too happy with it.
>netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'

No. The netdev could be any netdevice. It does not have to be a "VF".
I think "stolen" is quite appropriate since it describes the modus
operandi. The bypass master steals some netdevice according to some
match.

But I don't insist on "stolen". Just sounds right.



>
>
>
>> 
>> 
>> > +
>> > +	/* virtio_net netdev */
>> > +	struct net_device __rcu *backup_netdev;
>> > +
>> > +	/* active netdev stats */
>> > +	struct rtnl_link_stats64 active_stats;
>> > +
>> > +	/* backup netdev stats */
>> > +	struct rtnl_link_stats64 backup_stats;
>> > +
>> > +	/* aggregated stats */
>> > +	struct rtnl_link_stats64 bypass_stats;
>> > +
>> > +	/* spinlock while updating stats */
>> > +	spinlock_t stats_lock;
>> > +};
>> > +
>> > +#if IS_ENABLED(CONFIG_NET_BYPASS)
>> > +
>> > +int bypass_master_create(struct net_device *backup_netdev,
>> > +			 struct bypass_master **pbypass_master);
>> > +void bypass_master_destroy(struct bypass_master *bypass_master);
>> > +
>> > +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> > +			   struct bypass_master **pbypass_master);
>> > +void bypass_master_unregister(struct bypass_master *bypass_master);
>> > +
>> > +int bypass_slave_unregister(struct net_device *slave_netdev);
>> > +
>> > +#else
>> > +
>> > +static inline
>> > +int bypass_master_create(struct net_device *backup_netdev,
>> > +			 struct bypass_master **pbypass_master);
>> > +{
>> > +	return 0;
>> > +}
>> > +
>> > +static inline
>> > +void bypass_master_destroy(struct bypass_master *bypass_master)
>> > +{
>> > +}
>> > +
>> > +static inline
>> > +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> > +			   struct pbypass_master **pbypass_master);
>> > +{
>> > +	return 0;
>> > +}
>> > +
>> > +static inline
>> > +void bypass_master_unregister(struct bypass_master *bypass_master)
>> > +{
>> > +}
>> > +
>> > +static inline
>> > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > +{
>> > +	return 0;
>> > +}
>> > +
>> > +#endif
>> > +
>> > +#endif /* _NET_BYPASS_H */
>> > diff --git a/net/Kconfig b/net/Kconfig
>> > index 0428f12c25c2..994445f4a96a 100644
>> > --- a/net/Kconfig
>> > +++ b/net/Kconfig
>> > @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> > 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> > 	  devlink is a loadable module and the driver using it is built-in.
>> > 
>> > +config NET_BYPASS
>> > +	tristate "Bypass interface"
>> > +	---help---
>> > +	  This provides a generic interface for paravirtual drivers to listen
>> > +	  for netdev register/unregister/link change events from pci ethernet
>> > +	  devices with the same MAC and takeover their datapath. This also
>> > +	  enables live migration of a VM with direct attached VF by failing
>> > +	  over to the paravirtual datapath when the VF is unplugged.
>> > +
>> > +config MAY_USE_BYPASS
>> > +	tristate
>> > +	default m if NET_BYPASS=m
>> > +	default y if NET_BYPASS=y || NET_BYPASS=n
>> > +	help
>> > +	  Drivers using the bypass infrastructure should have a dependency
>> > +	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>> > +	  bypass is a loadable module and the driver using it is built-in.
>> > +
>> > endif   # if NET
>> > 
>> > # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> > diff --git a/net/core/Makefile b/net/core/Makefile
>> > index 6dbbba8c57ae..a9727ed1c8fc 100644
>> > --- a/net/core/Makefile
>> > +++ b/net/core/Makefile
>> > @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> > obj-$(CONFIG_HWBM) += hwbm.o
>> > obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> > obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> > +obj-$(CONFIG_NET_BYPASS) += bypass.o
>> > diff --git a/net/core/bypass.c b/net/core/bypass.c
>> > new file mode 100644
>> > index 000000000000..b5b9cb554c3f
>> > --- /dev/null
>> > +++ b/net/core/bypass.c
>> > @@ -0,0 +1,844 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +/* Copyright (c) 2018, Intel Corporation. */
>> > +
>> > +/* A common module to handle registrations and notifications for paravirtual
>> > + * drivers to enable accelerated datapath and support VF live migration.
>> > + *
>> > + * The notifier and event handling code is based on netvsc driver.
>> > + */
>> > +
>> > +#include <linux/netdevice.h>
>> > +#include <linux/etherdevice.h>
>> > +#include <linux/ethtool.h>
>> > +#include <linux/module.h>
>> > +#include <linux/slab.h>
>> > +#include <linux/netdevice.h>
>> > +#include <linux/netpoll.h>
>> > +#include <linux/rtnetlink.h>
>> > +#include <linux/if_vlan.h>
>> > +#include <linux/pci.h>
>> > +#include <net/sch_generic.h>
>> > +#include <uapi/linux/if_arp.h>
>> > +#include <net/bypass.h>
>> > +
>> > +static LIST_HEAD(bypass_master_list);
>> > +static DEFINE_SPINLOCK(bypass_lock);
>> > +
>> > +static int bypass_slave_pre_register(struct net_device *slave_netdev,
>> > +				     struct net_device *bypass_netdev,
>> > +				     struct bypass_ops *bypass_ops)
>> > +{
>> > +	struct bypass_info *bi;
>> > +	bool backup;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_pre_register)
>> > +			return -EINVAL;
>> > +
>> > +		return bypass_ops->slave_pre_register(slave_netdev,
>> > +						      bypass_netdev);
>> > +	}
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> > +	if (backup ? rtnl_dereference(bi->backup_netdev) :
>> > +			rtnl_dereference(bi->active_netdev)) {
>> > +		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>> > +			   slave_netdev->name, backup ? "backup" : "active");
>> > +		return -EEXIST;
>> > +	}
>> > +
>> > +	/* Avoid non pci devices as active netdev */
>> > +	if (!backup && (!slave_netdev->dev.parent ||
>> > +			!dev_is_pci(slave_netdev->dev.parent)))
>> > +		return -EINVAL;
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +static int bypass_slave_join(struct net_device *slave_netdev,
>> > +			     struct net_device *bypass_netdev,
>> > +			     struct bypass_ops *bypass_ops)
>> > +{
>> > +	struct bypass_info *bi;
>> > +	bool backup;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_join)
>> > +			return -EINVAL;
>> > +
>> > +		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>> > +	}
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> > +
>> > +	dev_hold(slave_netdev);
>> > +
>> > +	if (backup) {
>> > +		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>> > +		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>> > +	} else {
>> > +		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>> > +		dev_get_stats(bi->active_netdev, &bi->active_stats);
>> > +		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>> > +		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>> > +	}
>> > +
>> > +	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>> > +		    slave_netdev->name);
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +/* Called when slave dev is injecting data into network stack.
>> > + * Change the associated network device from lower dev to virtio.
>> > + * note: already called with rcu_read_lock
>> > + */
>> > +static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>> > +{
>> > +	struct sk_buff *skb = *pskb;
>> > +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> > +
>> > +	skb->dev = ndev;
>> > +
>> > +	return RX_HANDLER_ANOTHER;
>> > +}
>> > +
>> > +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> > +						  struct bypass_ops **ops)
>> > +{
>> > +	struct bypass_master *bypass_master;
>> > +	struct net_device *bypass_netdev;
>> > +
>> > +	spin_lock(&bypass_lock);
>> > +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> As I wrote the last time, you don't need this list, spinlock.
>> You can do just something like:
>>          for_each_net(net) {
>>                  for_each_netdev(net, dev) {
>> 			if (netif_is_bypass_master(dev)) {
>
>This function returns the upper netdev as well as the ops associated
>with that netdev.
>bypass_master_list is a list of 'struct bypass_master' that associates

Well, can't you have it in netdev priv?


>'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>NULL for 3-netdev model.

I see :(


>
>
>> 
>> 
>> 
>> 
>> > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> > +			*ops = rcu_dereference(bypass_master->ops);
>> I don't see how rcu_dereference is ok here.
>> 1) I don't see rcu_read_lock taken
>> 2) Looks like bypass_master->ops has the same value across the whole
>>     existence.
>
>We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>Yes. ops doesn't change.

If it does not change, you can just access it directly.


>
>> 
>> 
>> > +			spin_unlock(&bypass_lock);
>> > +			return bypass_netdev;
>> > +		}
>> > +	}
>> > +	spin_unlock(&bypass_lock);
>> > +	return NULL;
>> > +}
>> > +
>> > +static int bypass_slave_register(struct net_device *slave_netdev)
>> > +{
>> > +	struct net_device *bypass_netdev;
>> > +	struct bypass_ops *bypass_ops;
>> > +	int ret, orig_mtu;
>> > +
>> > +	ASSERT_RTNL();
>> > +
>> > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > +						&bypass_ops);
>> For master, could you use word "master" in the variables so it is clear?
>> Also, "dev" is fine instead of "netdev".
>> Something like "bpmaster_dev"
>
>bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.

I was trying to point out, that "bypass_netdev" represents a "master"
netdev, yet it does not say master. That is why I suggested
"bpmaster_dev"


>I can change all _netdev suffixes to _dev to make the names shorter.

ok.


>
>
>> 
>> 
>> > +	if (!bypass_netdev)
>> > +		goto done;
>> > +
>> > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> > +					bypass_ops);
>> > +	if (ret != 0)
>> 	Just "if (ret)" will do. You have this on more places.
>
>OK.
>
>
>> 
>> 
>> > +		goto done;
>> > +
>> > +	ret = netdev_rx_handler_register(slave_netdev,
>> > +					 bypass_ops ? bypass_ops->handle_frame :
>> > +					 bypass_handle_frame, bypass_netdev);
>> > +	if (ret != 0) {
>> > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> > +			   ret);
>> > +		goto done;
>> > +	}
>> > +
>> > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> > +	if (ret != 0) {
>> > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> > +			   bypass_netdev->name, ret);
>> > +		goto upper_link_failed;
>> > +	}
>> > +
>> > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> > +
>> > +	if (netif_running(bypass_netdev)) {
>> > +		ret = dev_open(slave_netdev);
>> > +		if (ret && (ret != -EBUSY)) {
>> > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> > +				   slave_netdev->name, ret);
>> > +			goto err_interface_up;
>> > +		}
>> > +	}
>> > +
>> > +	/* Align MTU of slave with master */
>> > +	orig_mtu = slave_netdev->mtu;
>> > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> > +	if (ret != 0) {
>> > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> > +			   slave_netdev->name, bypass_netdev->mtu);
>> > +		goto err_set_mtu;
>> > +	}
>> > +
>> > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> > +	if (ret != 0)
>> > +		goto err_join;
>> > +
>> > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> > +
>> > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> > +		    slave_netdev->name);
>> > +
>> > +	goto done;
>> > +
>> > +err_join:
>> > +	dev_set_mtu(slave_netdev, orig_mtu);
>> > +err_set_mtu:
>> > +	dev_close(slave_netdev);
>> > +err_interface_up:
>> > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > +upper_link_failed:
>> > +	netdev_rx_handler_unregister(slave_netdev);
>> > +done:
>> > +	return NOTIFY_DONE;
>> > +}
>> > +
>> > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> > +				       struct net_device *bypass_netdev,
>> > +				       struct bypass_ops *bypass_ops)
>> > +{
>> > +	struct net_device *backup_netdev, *active_netdev;
>> > +	struct bypass_info *bi;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_pre_unregister)
>> > +			return -EINVAL;
>> > +
>> > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> > +							bypass_netdev);
>> > +	}
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > +
>> > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > +		return -EINVAL;
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +static int bypass_slave_release(struct net_device *slave_netdev,
>> > +				struct net_device *bypass_netdev,
>> > +				struct bypass_ops *bypass_ops)
>> > +{
>> > +	struct net_device *backup_netdev, *active_netdev;
>> > +	struct bypass_info *bi;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_release)
>> > +			return -EINVAL;
>> I think it would be good to make the API to the driver more strict and
>> have a separate set of ops for "active" and "backup" netdevices.
>> That should stop people thinking about extending this to more slaves in
>> the future.
>
>We have checks in slave_pre_register() that allows only 1 'backup' and 1
>'active' slave.

I'm very well aware of that. I just thought that explicit ops for the
two slaves would make this more clear.


>
>
>> 
>> 
>> 
>> > +
>> > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> > +	}
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > +
>> > +	if (slave_netdev == backup_netdev) {
>> > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> > +	} else {
>> > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> > +		if (backup_netdev) {
>> > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > +		}
>> > +	}
>> > +
>> > +	dev_put(slave_netdev);
>> > +
>> > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> > +		    slave_netdev->name);
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > +{
>> > +	struct net_device *bypass_netdev;
>> > +	struct bypass_ops *bypass_ops;
>> > +	int ret;
>> > +
>> > +	if (!netif_is_bypass_slave(slave_netdev))
>> > +		goto done;
>> > +
>> > +	ASSERT_RTNL();
>> > +
>> > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > +						&bypass_ops);
>> > +	if (!bypass_netdev)
>> > +		goto done;
>> > +
>> > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> > +					  bypass_ops);
>> > +	if (ret != 0)
>> > +		goto done;
>> > +
>> > +	netdev_rx_handler_unregister(slave_netdev);
>> > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > +
>> > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> > +
>> > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> > +		    slave_netdev->name);
>> > +
>> > +done:
>> > +	return NOTIFY_DONE;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> > +
>> > +static bool bypass_xmit_ready(struct net_device *dev)
>> > +{
>> > +	return netif_running(dev) && netif_carrier_ok(dev);
>> > +}
>> > +
>> > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> > +{
>> > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> > +	struct bypass_ops *bypass_ops;
>> > +	struct bypass_info *bi;
>> > +
>> > +	if (!netif_is_bypass_slave(slave_netdev))
>> > +		goto done;
>> > +
>> > +	ASSERT_RTNL();
>> > +
>> > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > +						&bypass_ops);
>> > +	if (!bypass_netdev)
>> > +		goto done;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_link_change)
>> > +			goto done;
>> > +
>> > +		return bypass_ops->slave_link_change(slave_netdev,
>> > +						     bypass_netdev);
>> > +	}
>> > +
>> > +	if (!netif_running(bypass_netdev))
>> > +		return 0;
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +
>> > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > +
>> > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > +		goto done;
>> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> above is enough.
>
>I think we need this check to not allow events from a slave that is not
>attached to this master but has the same MAC.

Why do we need such events? Seems wrong to me. Consider:

bp1      bp2
a1 b1    a2 b2


a1 and a2 have the same mac and bp1 and bp2 have the same mac.
Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
the order of creation.
Let's say it will return bp1. Then when we have event for a2, the
bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.


You cannot use bypass_master_get_bymac() here.



>
>> 
>> 
>> > +
>> > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> > +		netif_carrier_on(bypass_netdev);
>> > +		netif_tx_wake_all_queues(bypass_netdev);
>> > +	} else {
>> > +		netif_carrier_off(bypass_netdev);
>> > +		netif_tx_stop_all_queues(bypass_netdev);
>> > +	}
>> > +
>> > +done:
>> > +	return NOTIFY_DONE;
>> > +}
>> > +
>> > +static bool bypass_validate_event_dev(struct net_device *dev)
>> > +{
>> > +	/* Skip parent events */
>> > +	if (netif_is_bypass_master(dev))
>> > +		return false;
>> > +
>> > +	/* Avoid non-Ethernet type devices */
>> > +	if (dev->type != ARPHRD_ETHER)
>> > +		return false;
>> > +
>> > +	/* Avoid Vlan dev with same MAC registering as VF */
>> > +	if (is_vlan_dev(dev))
>> > +		return false;
>> > +
>> > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> Yeah, this is certainly incorrect. One thing is, you should be using the
>> helpers netif_is_bond_master().
>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> 
>> You need to do it not by blacklisting, but with whitelisting. You need
>> to whitelist VF devices. My port flavours patchset might help with this.
>
>May be i can use netdev_has_lower_dev() helper to make sure that the slave

I don't see such function in the code.


>device is not an upper dev.
>Can you point to your port flavours patchset? Is it upstream?

I sent rfc couple of weeks ago:
[patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation


>
>> 
>> 
>> > +		return false;
>> > +
>> > +	return true;
>> > +}
>> > +
>> > +static int
>> > +bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>> > +{
>> > +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> > +
>> > +	if (!bypass_validate_event_dev(event_dev))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	switch (event) {
>> > +	case NETDEV_REGISTER:
>> > +		return bypass_slave_register(event_dev);
>> > +	case NETDEV_UNREGISTER:
>> > +		return bypass_slave_unregister(event_dev);
>> > +	case NETDEV_UP:
>> > +	case NETDEV_DOWN:
>> > +	case NETDEV_CHANGE:
>> > +		return bypass_slave_link_change(event_dev);
>> > +	default:
>> > +		return NOTIFY_DONE;
>> > +	}
>> > +}
>> > +
>> > +static struct notifier_block bypass_notifier = {
>> > +	.notifier_call = bypass_event,
>> > +};
>> > +
>> > +int bypass_open(struct net_device *dev)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> > +	struct net_device *active_netdev, *backup_netdev;
>> > +	int err;
>> > +
>> > +	netif_carrier_off(dev);
>> > +	netif_tx_wake_all_queues(dev);
>> > +
>> > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > +	if (active_netdev) {
>> > +		err = dev_open(active_netdev);
>> > +		if (err)
>> > +			goto err_active_open;
>> > +	}
>> > +
>> > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > +	if (backup_netdev) {
>> > +		err = dev_open(backup_netdev);
>> > +		if (err)
>> > +			goto err_backup_open;
>> > +	}
>> > +
>> > +	return 0;
>> > +
>> > +err_backup_open:
>> > +	dev_close(active_netdev);
>> > +err_active_open:
>> > +	netif_tx_disable(dev);
>> > +	return err;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_open);
>> > +
>> > +int bypass_close(struct net_device *dev)
>> > +{
>> > +	struct bypass_info *vi = netdev_priv(dev);
>> This should be probably "bi"
>
>Yes.
>
>
>> 
>> 
>> > +	struct net_device *slave_netdev;
>> > +
>> > +	netif_tx_disable(dev);
>> > +
>> > +	slave_netdev = rtnl_dereference(vi->active_netdev);
>> > +	if (slave_netdev)
>> > +		dev_close(slave_netdev);
>> > +
>> > +	slave_netdev = rtnl_dereference(vi->backup_netdev);
>> > +	if (slave_netdev)
>> > +		dev_close(slave_netdev);
>> > +
>> > +	return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_close);
>> > +
>> > +static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>> > +{
>> > +	atomic_long_inc(&dev->tx_dropped);
>> > +	dev_kfree_skb_any(skb);
>> > +	return NETDEV_TX_OK;
>> > +}
>> > +
>> > +netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> If you rename the other variable to "bpmaster_dev", it would be nice to
>> rename this to bpinfo or something more descriptive. "bi" is too short
>> to know what that is right away.
>
>Will rename bypass_netdev to bypass_dev. bypass indicates that it is
>an upper master dev.
>
>
>> 
>> 
>> > +	struct net_device *xmit_dev;
>> Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.
>
>OK.
>
>
>> 
>> 
>> 
>> > +
>> > +	/* Try xmit via active netdev followed by backup netdev */
>> > +	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>> > +	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>> > +		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>> > +		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>> > +			return bypass_drop_xmit(skb, dev);
>> > +	}
>> > +
>> > +	skb->dev = xmit_dev;
>> > +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> > +
>> > +	return dev_queue_xmit(skb);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_start_xmit);
>> > +
>> > +u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>> > +			void *accel_priv, select_queue_fallback_t fallback)
>> > +{
>> > +	/* This helper function exists to help dev_pick_tx get the correct
>> > +	 * destination queue.  Using a helper function skips a call to
>> > +	 * skb_tx_hash and will put the skbs in the queue we expect on their
>> > +	 * way down to the bonding driver.
>> > +	 */
>> > +	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> > +
>> > +	/* Save the original txq to restore before passing to the driver */
>> > +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> > +
>> > +	if (unlikely(txq >= dev->real_num_tx_queues)) {
>> > +		do {
>> > +			txq -= dev->real_num_tx_queues;
>> > +		} while (txq >= dev->real_num_tx_queues);
>> > +	}
>> > +
>> > +	return txq;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_select_queue);
>> > +
>> > +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> > + * that some drivers can provide 32bit values only.
>> > + */
>> > +static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> > +			      const struct rtnl_link_stats64 *_new,
>> > +			      const struct rtnl_link_stats64 *_old)
>> > +{
>> > +	const u64 *new = (const u64 *)_new;
>> > +	const u64 *old = (const u64 *)_old;
>> > +	u64 *res = (u64 *)_res;
>> > +	int i;
>> > +
>> > +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> > +		u64 nv = new[i];
>> > +		u64 ov = old[i];
>> > +		s64 delta = nv - ov;
>> > +
>> > +		/* detects if this particular field is 32bit only */
>> > +		if (((nv | ov) >> 32) == 0)
>> > +			delta = (s64)(s32)((u32)nv - (u32)ov);
>> > +
>> > +		/* filter anomalies, some drivers reset their stats
>> > +		 * at down/up events.
>> > +		 */
>> > +		if (delta > 0)
>> > +			res[i] += delta;
>> > +	}
>> > +}
>> > +
>> > +void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> You can WARN_ON and return in case the dev is not bypass master, just
>> to catch buggy drivers. Same with other helpers.
>
>I can make this static and not export this helper as well as all
>bypass_netdev ops.

Ok.


>
>> 
>> 
>> > +	const struct rtnl_link_stats64 *new;
>> > +	struct rtnl_link_stats64 temp;
>> > +	struct net_device *slave_netdev;
>> > +
>> > +	spin_lock(&bi->stats_lock);
>> > +	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>> > +
>> > +	rcu_read_lock();
>> > +
>> > +	slave_netdev = rcu_dereference(bi->active_netdev);
>> > +	if (slave_netdev) {
>> > +		new = dev_get_stats(slave_netdev, &temp);
>> > +		bypass_fold_stats(stats, new, &bi->active_stats);
>> > +		memcpy(&bi->active_stats, new, sizeof(*new));
>> > +	}
>> > +
>> > +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> > +	if (slave_netdev) {
>> > +		new = dev_get_stats(slave_netdev, &temp);
>> > +		bypass_fold_stats(stats, new, &bi->backup_stats);
>> > +		memcpy(&bi->backup_stats, new, sizeof(*new));
>> > +	}
>> > +
>> > +	rcu_read_unlock();
>> > +
>> > +	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>> > +	spin_unlock(&bi->stats_lock);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_get_stats);
>> > +
>> > +int bypass_change_mtu(struct net_device *dev, int new_mtu)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> > +	struct net_device *active_netdev, *backup_netdev;
>> > +	int ret = 0;
>> Pointless initialization.
>> 
>> 
>> > +
>> > +	active_netdev = rcu_dereference(bi->active_netdev);
>> > +	if (active_netdev) {
>> > +		ret = dev_set_mtu(active_netdev, new_mtu);
>> > +		if (ret)
>> > +			return ret;
>> > +	}
>> > +
>> > +	backup_netdev = rcu_dereference(bi->backup_netdev);
>> > +	if (backup_netdev) {
>> > +		ret = dev_set_mtu(backup_netdev, new_mtu);
>> > +		if (ret) {
>> > +			dev_set_mtu(active_netdev, dev->mtu);
>> > +			return ret;
>> > +		}
>> > +	}
>> > +
>> > +	dev->mtu = new_mtu;
>> > +	return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_change_mtu);
>> > +
>> > +void bypass_set_rx_mode(struct net_device *dev)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> > +	struct net_device *slave_netdev;
>> > +
>> > +	rcu_read_lock();
>> > +
>> > +	slave_netdev = rcu_dereference(bi->active_netdev);
>> > +	if (slave_netdev) {
>> > +		dev_uc_sync_multiple(slave_netdev, dev);
>> > +		dev_mc_sync_multiple(slave_netdev, dev);
>> > +	}
>> > +
>> > +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> > +	if (slave_netdev) {
>> > +		dev_uc_sync_multiple(slave_netdev, dev);
>> > +		dev_mc_sync_multiple(slave_netdev, dev);
>> > +	}
>> > +
>> > +	rcu_read_unlock();
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>> > +
>> > +static const struct net_device_ops bypass_netdev_ops = {
>> > +	.ndo_open		= bypass_open,
>> > +	.ndo_stop		= bypass_close,
>> > +	.ndo_start_xmit		= bypass_start_xmit,
>> > +	.ndo_select_queue	= bypass_select_queue,
>> > +	.ndo_get_stats64	= bypass_get_stats,
>> > +	.ndo_change_mtu		= bypass_change_mtu,
>> > +	.ndo_set_rx_mode	= bypass_set_rx_mode,
>> > +	.ndo_validate_addr	= eth_validate_addr,
>> > +	.ndo_features_check	= passthru_features_check,
>> > +};
>> > +
>> > +#define BYPASS_DRV_NAME "bypass"
>> > +#define BYPASS_DRV_VERSION "0.1"
>> > +
>> > +static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>> > +				       struct ethtool_drvinfo *drvinfo)
>> > +{
>> > +	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>> > +	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>> > +}
>> > +
>> > +int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>> > +				      struct ethtool_link_ksettings *cmd)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> > +	struct net_device *slave_netdev;
>> > +
>> > +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> > +	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> > +		slave_netdev = rtnl_dereference(bi->backup_netdev);
>> > +		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> > +			cmd->base.duplex = DUPLEX_UNKNOWN;
>> > +			cmd->base.port = PORT_OTHER;
>> > +			cmd->base.speed = SPEED_UNKNOWN;
>> > +
>> > +			return 0;
>> > +		}
>> > +	}
>> > +
>> > +	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>> > +
>> > +static const struct ethtool_ops bypass_ethtool_ops = {
>> > +	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>> > +	.get_link               = ethtool_op_get_link,
>> > +	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>> > +};
>> > +
>> > +static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>> > +{
>> > +	struct net *net = dev_net(bypass_netdev);
>> > +	struct net_device *dev;
>> > +
>> > +	rtnl_lock();
>> > +	for_each_netdev(net, dev) {
>> > +		if (dev == bypass_netdev)
>> > +			continue;
>> > +		if (!bypass_validate_event_dev(dev))
>> > +			continue;
>> > +		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>> > +			bypass_slave_register(dev);
>> > +	}
>> > +	rtnl_unlock();
>> > +}
>> > +
>> > +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> > +			   struct bypass_master **pbypass_master)
>> > +{
>> > +	struct bypass_master *bypass_master;
>> > +
>> > +	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>> > +	if (!bypass_master)
>> > +		return -ENOMEM;
>> > +
>> > +	rcu_assign_pointer(bypass_master->ops, ops);
>> > +	dev_hold(dev);
>> > +	dev->priv_flags |= IFF_BYPASS;
>> > +	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>> > +
>> > +	spin_lock(&bypass_lock);
>> > +	list_add_tail(&bypass_master->list, &bypass_master_list);
>> > +	spin_unlock(&bypass_lock);
>> > +
>> > +	bypass_register_existing_slave(dev);
>> > +
>> > +	*pbypass_master = bypass_master;
>> > +	return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_master_register);
>> > +
>> > +void bypass_master_unregister(struct bypass_master *bypass_master)
>> > +{
>> > +	struct net_device *bypass_netdev;
>> > +
>> > +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > +
>> > +	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>> > +	dev_put(bypass_netdev);
>> > +
>> > +	spin_lock(&bypass_lock);
>> > +	list_del(&bypass_master->list);
>> > +	spin_unlock(&bypass_lock);
>> > +
>> > +	kfree(bypass_master);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_master_unregister);
>> > +
>> > +int bypass_master_create(struct net_device *backup_netdev,
>> > +			 struct bypass_master **pbypass_master)
>> > +{
>> > +	struct device *dev = backup_netdev->dev.parent;
>> > +	struct net_device *bypass_netdev;
>> > +	int err;
>> > +
>> > +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> > +	 * that most devices being bonded won't have too many queues.
>> > +	 */
>> > +	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>> > +	if (!bypass_netdev) {
>> > +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> > +		return -ENOMEM;
>> > +	}
>> > +
>> > +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> > +	SET_NETDEV_DEV(bypass_netdev, dev);
>> > +
>> > +	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>> > +	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>> > +
>> > +	/* Initialize the device options */
>> > +	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> > +	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> > +				       IFF_TX_SKB_SHARING);
>> > +
>> > +	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>> > +	bypass_netdev->features |= NETIF_F_LLTX;
>> > +
>> > +	/* Don't allow bypass devices to change network namespaces. */
>> > +	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>> > +
>> > +	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> > +				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> > +				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>> > +
>> > +	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> > +	bypass_netdev->features |= bypass_netdev->hw_features;
>> > +
>> > +	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>> > +	       bypass_netdev->addr_len);
>> > +
>> > +	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > +	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > +
>> > +	err = register_netdev(bypass_netdev);
>> > +	if (err < 0) {
>> > +		dev_err(dev, "Unable to register bypass_netdev!\n");
>> > +		goto err_register_netdev;
>> > +	}
>> > +
>> > +	netif_carrier_off(bypass_netdev);
>> > +
>> > +	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>> > +	if (err < 0)
>> just "if (err)" would do.
>
>OK
>
>> 
>> 
>> > +		goto err_bypass;
>> > +
>> > +	return 0;
>> > +
>> > +err_bypass:
>> > +	unregister_netdev(bypass_netdev);
>> > +err_register_netdev:
>> > +	free_netdev(bypass_netdev);
>> > +
>> > +	return err;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_master_create);
>> > +
>> > +void bypass_master_destroy(struct bypass_master *bypass_master)
>> > +{
>> > +	struct net_device *bypass_netdev;
>> > +	struct net_device *slave_netdev;
>> > +	struct bypass_info *bi;
>> > +
>> > +	if (!bypass_master)
>> > +		return;
>> > +
>> > +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > +	bi = netdev_priv(bypass_netdev);
>> > +
>> > +	netif_device_detach(bypass_netdev);
>> > +
>> > +	rtnl_lock();
>> > +
>> > +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> > +	if (slave_netdev)
>> > +		bypass_slave_unregister(slave_netdev);
>> > +
>> > +	slave_netdev = rtnl_dereference(bi->backup_netdev);
>> > +	if (slave_netdev)
>> > +		bypass_slave_unregister(slave_netdev);
>> > +
>> > +	bypass_master_unregister(bypass_master);
>> > +
>> > +	unregister_netdevice(bypass_netdev);
>> > +
>> > +	rtnl_unlock();
>> > +
>> > +	free_netdev(bypass_netdev);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_master_destroy);
>> > +
>> > +static __init int
>> > +bypass_init(void)
>> > +{
>> > +	register_netdevice_notifier(&bypass_notifier);
>> > +
>> > +	return 0;
>> > +}
>> > +module_init(bypass_init);
>> > +
>> > +static __exit
>> > +void bypass_exit(void)
>> > +{
>> > +	unregister_netdevice_notifier(&bypass_notifier);
>> > +}
>> > +module_exit(bypass_exit);
>> > +
>> > +MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>> > +MODULE_LICENSE("GPL v2");
>> > -- 
>> > 2.14.3
>> > 
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]       ` <20180418092515.GB1989@nanopsycho>
@ 2018-04-18 18:43         ` Samudrala, Sridhar
       [not found]         ` <dd0d53f0-f5da-cb7d-f8f6-d0c8245eb3cf@intel.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18 18:43 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem

On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>>>> This provides a generic interface for paravirtual drivers to listen
>>>> for netdev register/unregister/link change events from pci ethernet
>>>> devices with the same MAC and takeover their datapath. The notifier and
>>>> event handling code is based on the existing netvsc implementation.
>>>>
>>>> It exposes 2 sets of interfaces to the paravirtual drivers.
>>>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>>>> master netdev is created. The paravirtual driver registers each bypass
>>>> instance along with a set of ops to manage the slave events.
>>>>       bypass_master_register()
>>>>       bypass_master_unregister()
>>>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>>>> the bypass module provides interfaces to create/destroy additional master
>>>> netdev and all the slave events are managed internally.
>>>>        bypass_master_create()
>>>>        bypass_master_destroy()
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> ---
>>>> include/linux/netdevice.h |  14 +
>>>> include/net/bypass.h      |  96 ++++++
>>>> net/Kconfig               |  18 +
>>>> net/core/Makefile         |   1 +
>>>> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> 5 files changed, 973 insertions(+)
>>>> create mode 100644 include/net/bypass.h
>>>> create mode 100644 net/core/bypass.c
>>>>
>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>> index cf44503ea81a..587293728f70 100644
>>>> --- a/include/linux/netdevice.h
>>>> +++ b/include/linux/netdevice.h
>>>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>>>> 	IFF_PHONY_HEADROOM		= 1<<24,
>>>> 	IFF_MACSEC			= 1<<25,
>>>> 	IFF_NO_RX_HANDLER		= 1<<26,
>>>> +	IFF_BYPASS			= 1 << 27,
>>>> +	IFF_BYPASS_SLAVE		= 1 << 28,
>>> I wonder, why you don't follow the existing coding style... Also, please
>>> add these to into the comment above.
>> To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> to the existing coding style to be consistent.
> Please do.
>
>
>>>
>>>> };
>>>>
>>>> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>>>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>>>> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>>>> #define IFF_MACSEC			IFF_MACSEC
>>>> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>>>> +#define IFF_BYPASS			IFF_BYPASS
>>>> +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>>>>
>>>> /**
>>>>    *	struct net_device - The DEVICE structure.
>>>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>>>> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>>>> }
>>>>
>>>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>>>> +{
>>>> +	return dev->priv_flags & IFF_BYPASS;
>>>> +}
>>>> +
>>>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>>>> +{
>>>> +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>>>> +}
>>>> +
>>>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>>>> static inline void netif_keep_dst(struct net_device *dev)
>>>> {
>>>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>>>> new file mode 100644
>>>> index 000000000000..86b02cb894cf
>>>> --- /dev/null
>>>> +++ b/include/net/bypass.h
>>>> @@ -0,0 +1,96 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/* Copyright (c) 2018, Intel Corporation. */
>>>> +
>>>> +#ifndef _NET_BYPASS_H
>>>> +#define _NET_BYPASS_H
>>>> +
>>>> +#include <linux/netdevice.h>
>>>> +
>>>> +struct bypass_ops {
>>>> +	int (*slave_pre_register)(struct net_device *slave_netdev,
>>>> +				  struct net_device *bypass_netdev);
>>>> +	int (*slave_join)(struct net_device *slave_netdev,
>>>> +			  struct net_device *bypass_netdev);
>>>> +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>>>> +				    struct net_device *bypass_netdev);
>>>> +	int (*slave_release)(struct net_device *slave_netdev,
>>>> +			     struct net_device *bypass_netdev);
>>>> +	int (*slave_link_change)(struct net_device *slave_netdev,
>>>> +				 struct net_device *bypass_netdev);
>>>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>>>> +};
>>>> +
>>>> +struct bypass_master {
>>>> +	struct list_head list;
>>>> +	struct net_device __rcu *bypass_netdev;
>>>> +	struct bypass_ops __rcu *ops;
>>>> +};
>>>> +
>>>> +/* bypass state */
>>>> +struct bypass_info {
>>>> +	/* passthru netdev with same MAC */
>>>> +	struct net_device __rcu *active_netdev;
>>> You still use "active"/"backup" names which is highly misleading as
>>> it has completely different meaning that in bond for example.
>>> I noted that in my previous review already. Please change it.
>> I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> matches with the BACKUP feature bit we are adding to virtio_net.
> I think that "backup" is also misleading. Both "active" and "backup"
> mean a *state* of slaves. This should be named differently.
>
>
>
>> With regards to alternate names for 'active', you suggested 'stolen', but i
>> am not too happy with it.
>> netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> No. The netdev could be any netdevice. It does not have to be a "VF".
> I think "stolen" is quite appropriate since it describes the modus
> operandi. The bypass master steals some netdevice according to some
> match.
>
> But I don't insist on "stolen". Just sounds right.

We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
'backup' name is consistent.

The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.

Will look for any suggestions in the next day or two. If i don't get any, i will go
with 'stolen'

<snip>


> +
> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> +						  struct bypass_ops **ops)
> +{
> +	struct bypass_master *bypass_master;
> +	struct net_device *bypass_netdev;
> +
> +	spin_lock(&bypass_lock);
> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>>> As I wrote the last time, you don't need this list, spinlock.
>>> You can do just something like:
>>>           for_each_net(net) {
>>>                   for_each_netdev(net, dev) {
>>> 			if (netif_is_bypass_master(dev)) {
>> This function returns the upper netdev as well as the ops associated
>> with that netdev.
>> bypass_master_list is a list of 'struct bypass_master' that associates
> Well, can't you have it in netdev priv?

We cannot do this for 2-netdev model as there is no bypass_netdev created.

>
>
>> 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> NULL for 3-netdev model.
> I see :(
>
>
>>
>>>
>>>
>>>
>>>> +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>>>> +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>>>> +			*ops = rcu_dereference(bypass_master->ops);
>>> I don't see how rcu_dereference is ok here.
>>> 1) I don't see rcu_read_lock taken
>>> 2) Looks like bypass_master->ops has the same value across the whole
>>>      existence.
>> We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> Yes. ops doesn't change.
> If it does not change, you can just access it directly.
>
>
>>>
>>>> +			spin_unlock(&bypass_lock);
>>>> +			return bypass_netdev;
>>>> +		}
>>>> +	}
>>>> +	spin_unlock(&bypass_lock);
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static int bypass_slave_register(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	int ret, orig_mtu;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>> For master, could you use word "master" in the variables so it is clear?
>>> Also, "dev" is fine instead of "netdev".
>>> Something like "bpmaster_dev"
>> bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
> I was trying to point out, that "bypass_netdev" represents a "master"
> netdev, yet it does not say master. That is why I suggested
> "bpmaster_dev"
>
>
>> I can change all _netdev suffixes to _dev to make the names shorter.
> ok.
>
>
>>
>>>
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>>>> +					bypass_ops);
>>>> +	if (ret != 0)
>>> 	Just "if (ret)" will do. You have this on more places.
>> OK.
>>
>>
>>>
>>>> +		goto done;
>>>> +
>>>> +	ret = netdev_rx_handler_register(slave_netdev,
>>>> +					 bypass_ops ? bypass_ops->handle_frame :
>>>> +					 bypass_handle_frame, bypass_netdev);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>>>> +			   ret);
>>>> +		goto done;
>>>> +	}
>>>> +
>>>> +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>>>> +			   bypass_netdev->name, ret);
>>>> +		goto upper_link_failed;
>>>> +	}
>>>> +
>>>> +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>>>> +
>>>> +	if (netif_running(bypass_netdev)) {
>>>> +		ret = dev_open(slave_netdev);
>>>> +		if (ret && (ret != -EBUSY)) {
>>>> +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>>>> +				   slave_netdev->name, ret);
>>>> +			goto err_interface_up;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* Align MTU of slave with master */
>>>> +	orig_mtu = slave_netdev->mtu;
>>>> +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>>>> +			   slave_netdev->name, bypass_netdev->mtu);
>>>> +		goto err_set_mtu;
>>>> +	}
>>>> +
>>>> +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>>>> +	if (ret != 0)
>>>> +		goto err_join;
>>>> +
>>>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +	goto done;
>>>> +
>>>> +err_join:
>>>> +	dev_set_mtu(slave_netdev, orig_mtu);
>>>> +err_set_mtu:
>>>> +	dev_close(slave_netdev);
>>>> +err_interface_up:
>>>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +upper_link_failed:
>>>> +	netdev_rx_handler_unregister(slave_netdev);
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>>>> +				       struct net_device *bypass_netdev,
>>>> +				       struct bypass_ops *bypass_ops)
>>>> +{
>>>> +	struct net_device *backup_netdev, *active_netdev;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_pre_unregister)
>>>> +			return -EINVAL;
>>>> +
>>>> +		return bypass_ops->slave_pre_unregister(slave_netdev,
>>>> +							bypass_netdev);
>>>> +	}
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> +		return -EINVAL;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int bypass_slave_release(struct net_device *slave_netdev,
>>>> +				struct net_device *bypass_netdev,
>>>> +				struct bypass_ops *bypass_ops)
>>>> +{
>>>> +	struct net_device *backup_netdev, *active_netdev;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_release)
>>>> +			return -EINVAL;
>>> I think it would be good to make the API to the driver more strict and
>>> have a separate set of ops for "active" and "backup" netdevices.
>>> That should stop people thinking about extending this to more slaves in
>>> the future.
>> We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> 'active' slave.
> I'm very well aware of that. I just thought that explicit ops for the
> two slaves would make this more clear.
>
>
>>
>>>
>>>
>>>> +
>>>> +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>>>> +	}
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev == backup_netdev) {
>>>> +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>>>> +	} else {
>>>> +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>>>> +		if (backup_netdev) {
>>>> +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>>>> +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	dev_put(slave_netdev);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	int ret;
>>>> +
>>>> +	if (!netif_is_bypass_slave(slave_netdev))
>>>> +		goto done;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>>>> +					  bypass_ops);
>>>> +	if (ret != 0)
>>>> +		goto done;
>>>> +
>>>> +	netdev_rx_handler_unregister(slave_netdev);
>>>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +
>>>> +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>>>> +
>>>> +static bool bypass_xmit_ready(struct net_device *dev)
>>>> +{
>>>> +	return netif_running(dev) && netif_carrier_ok(dev);
>>>> +}
>>>> +
>>>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (!netif_is_bypass_slave(slave_netdev))
>>>> +		goto done;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_link_change)
>>>> +			goto done;
>>>> +
>>>> +		return bypass_ops->slave_link_change(slave_netdev,
>>>> +						     bypass_netdev);
>>>> +	}
>>>> +
>>>> +	if (!netif_running(bypass_netdev))
>>>> +		return 0;
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> +		goto done;
>>> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>>> above is enough.
>> I think we need this check to not allow events from a slave that is not
>> attached to this master but has the same MAC.
> Why do we need such events? Seems wrong to me.

We want to avoid events from a netdev that is mis-configured with the same MAC as
a bypass setup.

>   Consider:
>
> bp1      bp2
> a1 b1    a2 b2
>
>
> a1 and a2 have the same mac and bp1 and bp2 have the same mac.

We should not have 2 bypass configs with the same MAC.
I need to add a check in the bypass_master_register() to prevent this.

The above check is to avoid cases where we have
bp1(a1, b1) with mac1
and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.

> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> the order of creation.
> Let's say it will return bp1. Then when we have event for a2, the
> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>
>
> You cannot use bypass_master_get_bymac() here.
>
>
>
>>>
>>>> +
>>>> +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>>>> +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>>>> +		netif_carrier_on(bypass_netdev);
>>>> +		netif_tx_wake_all_queues(bypass_netdev);
>>>> +	} else {
>>>> +		netif_carrier_off(bypass_netdev);
>>>> +		netif_tx_stop_all_queues(bypass_netdev);
>>>> +	}
>>>> +
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static bool bypass_validate_event_dev(struct net_device *dev)
>>>> +{
>>>> +	/* Skip parent events */
>>>> +	if (netif_is_bypass_master(dev))
>>>> +		return false;
>>>> +
>>>> +	/* Avoid non-Ethernet type devices */
>>>> +	if (dev->type != ARPHRD_ETHER)
>>>> +		return false;
>>>> +
>>>> +	/* Avoid Vlan dev with same MAC registering as VF */
>>>> +	if (is_vlan_dev(dev))
>>>> +		return false;
>>>> +
>>>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>>>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>>> Yeah, this is certainly incorrect. One thing is, you should be using the
>>> helpers netif_is_bond_master().
>>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>>>
>>> You need to do it not by blacklisting, but with whitelisting. You need
>>> to whitelist VF devices. My port flavours patchset might help with this.
>> May be i can use netdev_has_lower_dev() helper to make sure that the slave
> I don't see such function in the code.

It is netdev_has_any_lower_dev(). I need to export it.

>
>
>> device is not an upper dev.
>> Can you point to your port flavours patchset? Is it upstream?
> I sent rfc couple of weeks ago:
> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation



_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]         ` <dd0d53f0-f5da-cb7d-f8f6-d0c8245eb3cf@intel.com>
@ 2018-04-18 19:13           ` Jiri Pirko
       [not found]           ` <20180418191315.GA1922@nanopsycho>
  1 sibling, 0 replies; 47+ messages in thread
From: Jiri Pirko @ 2018-04-18 19:13 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem

Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > This provides a generic interface for paravirtual drivers to listen
>> > > > for netdev register/unregister/link change events from pci ethernet
>> > > > devices with the same MAC and takeover their datapath. The notifier and
>> > > > event handling code is based on the existing netvsc implementation.
>> > > > 
>> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > > > master netdev is created. The paravirtual driver registers each bypass
>> > > > instance along with a set of ops to manage the slave events.
>> > > >       bypass_master_register()
>> > > >       bypass_master_unregister()
>> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> > > > the bypass module provides interfaces to create/destroy additional master
>> > > > netdev and all the slave events are managed internally.
>> > > >        bypass_master_create()
>> > > >        bypass_master_destroy()
>> > > > 
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > ---
>> > > > include/linux/netdevice.h |  14 +
>> > > > include/net/bypass.h      |  96 ++++++
>> > > > net/Kconfig               |  18 +
>> > > > net/core/Makefile         |   1 +
>> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> > > > 5 files changed, 973 insertions(+)
>> > > > create mode 100644 include/net/bypass.h
>> > > > create mode 100644 net/core/bypass.c
>> > > > 
>> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > > > index cf44503ea81a..587293728f70 100644
>> > > > --- a/include/linux/netdevice.h
>> > > > +++ b/include/linux/netdevice.h
>> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
>> > > > 	IFF_MACSEC			= 1<<25,
>> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
>> > > > +	IFF_BYPASS			= 1 << 27,
>> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> > > I wonder, why you don't follow the existing coding style... Also, please
>> > > add these to into the comment above.
>> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> > to the existing coding style to be consistent.
>> Please do.
>> 
>> 
>> > > 
>> > > > };
>> > > > 
>> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> > > > #define IFF_MACSEC			IFF_MACSEC
>> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> > > > +#define IFF_BYPASS			IFF_BYPASS
>> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> > > > 
>> > > > /**
>> > > >    *	struct net_device - The DEVICE structure.
>> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > > > }
>> > > > 
>> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> > > > +{
>> > > > +	return dev->priv_flags & IFF_BYPASS;
>> > > > +}
>> > > > +
>> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > > > +{
>> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> > > > +}
>> > > > +
>> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> > > > static inline void netif_keep_dst(struct net_device *dev)
>> > > > {
>> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> > > > new file mode 100644
>> > > > index 000000000000..86b02cb894cf
>> > > > --- /dev/null
>> > > > +++ b/include/net/bypass.h
>> > > > @@ -0,0 +1,96 @@
>> > > > +// SPDX-License-Identifier: GPL-2.0
>> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> > > > +
>> > > > +#ifndef _NET_BYPASS_H
>> > > > +#define _NET_BYPASS_H
>> > > > +
>> > > > +#include <linux/netdevice.h>
>> > > > +
>> > > > +struct bypass_ops {
>> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> > > > +				  struct net_device *bypass_netdev);
>> > > > +	int (*slave_join)(struct net_device *slave_netdev,
>> > > > +			  struct net_device *bypass_netdev);
>> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> > > > +				    struct net_device *bypass_netdev);
>> > > > +	int (*slave_release)(struct net_device *slave_netdev,
>> > > > +			     struct net_device *bypass_netdev);
>> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> > > > +				 struct net_device *bypass_netdev);
>> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> > > > +};
>> > > > +
>> > > > +struct bypass_master {
>> > > > +	struct list_head list;
>> > > > +	struct net_device __rcu *bypass_netdev;
>> > > > +	struct bypass_ops __rcu *ops;
>> > > > +};
>> > > > +
>> > > > +/* bypass state */
>> > > > +struct bypass_info {
>> > > > +	/* passthru netdev with same MAC */
>> > > > +	struct net_device __rcu *active_netdev;
>> > > You still use "active"/"backup" names which is highly misleading as
>> > > it has completely different meaning that in bond for example.
>> > > I noted that in my previous review already. Please change it.
>> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> > matches with the BACKUP feature bit we are adding to virtio_net.
>> I think that "backup" is also misleading. Both "active" and "backup"
>> mean a *state* of slaves. This should be named differently.
>> 
>> 
>> 
>> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> > am not too happy with it.
>> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> No. The netdev could be any netdevice. It does not have to be a "VF".
>> I think "stolen" is quite appropriate since it describes the modus
>> operandi. The bypass master steals some netdevice according to some
>> match.
>> 
>> But I don't insist on "stolen". Just sounds right.
>
>We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>'backup' name is consistent.

It perhaps makes sense from the view of virtio device. However, as I
described couple of times, for master/slave device the name "backup" is
highly misleading.


>
>The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>
>Will look for any suggestions in the next day or two. If i don't get any, i will go
>with 'stolen'
>
><snip>
>
>
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> +						  struct bypass_ops **ops)
>> +{
>> +	struct bypass_master *bypass_master;
>> +	struct net_device *bypass_netdev;
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> > > As I wrote the last time, you don't need this list, spinlock.
>> > > You can do just something like:
>> > >           for_each_net(net) {
>> > >                   for_each_netdev(net, dev) {
>> > > 			if (netif_is_bypass_master(dev)) {
>> > This function returns the upper netdev as well as the ops associated
>> > with that netdev.
>> > bypass_master_list is a list of 'struct bypass_master' that associates
>> Well, can't you have it in netdev priv?
>
>We cannot do this for 2-netdev model as there is no bypass_netdev created.

Howcome? You have no master? I don't understand..



>
>> 
>> 
>> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> > NULL for 3-netdev model.
>> I see :(
>> 
>> 
>> > 
>> > > 
>> > > 
>> > > 
>> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> > > > +			*ops = rcu_dereference(bypass_master->ops);
>> > > I don't see how rcu_dereference is ok here.
>> > > 1) I don't see rcu_read_lock taken
>> > > 2) Looks like bypass_master->ops has the same value across the whole
>> > >      existence.
>> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> > Yes. ops doesn't change.
>> If it does not change, you can just access it directly.
>> 
>> 
>> > > 
>> > > > +			spin_unlock(&bypass_lock);
>> > > > +			return bypass_netdev;
>> > > > +		}
>> > > > +	}
>> > > > +	spin_unlock(&bypass_lock);
>> > > > +	return NULL;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	int ret, orig_mtu;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > For master, could you use word "master" in the variables so it is clear?
>> > > Also, "dev" is fine instead of "netdev".
>> > > Something like "bpmaster_dev"
>> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
>> I was trying to point out, that "bypass_netdev" represents a "master"
>> netdev, yet it does not say master. That is why I suggested
>> "bpmaster_dev"
>> 
>> 
>> > I can change all _netdev suffixes to _dev to make the names shorter.
>> ok.
>> 
>> 
>> > 
>> > > 
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> > > > +					bypass_ops);
>> > > > +	if (ret != 0)
>> > > 	Just "if (ret)" will do. You have this on more places.
>> > OK.
>> > 
>> > 
>> > > 
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = netdev_rx_handler_register(slave_netdev,
>> > > > +					 bypass_ops ? bypass_ops->handle_frame :
>> > > > +					 bypass_handle_frame, bypass_netdev);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> > > > +			   ret);
>> > > > +		goto done;
>> > > > +	}
>> > > > +
>> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> > > > +			   bypass_netdev->name, ret);
>> > > > +		goto upper_link_failed;
>> > > > +	}
>> > > > +
>> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> > > > +
>> > > > +	if (netif_running(bypass_netdev)) {
>> > > > +		ret = dev_open(slave_netdev);
>> > > > +		if (ret && (ret != -EBUSY)) {
>> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> > > > +				   slave_netdev->name, ret);
>> > > > +			goto err_interface_up;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	/* Align MTU of slave with master */
>> > > > +	orig_mtu = slave_netdev->mtu;
>> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> > > > +			   slave_netdev->name, bypass_netdev->mtu);
>> > > > +		goto err_set_mtu;
>> > > > +	}
>> > > > +
>> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +	if (ret != 0)
>> > > > +		goto err_join;
>> > > > +
>> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +	goto done;
>> > > > +
>> > > > +err_join:
>> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
>> > > > +err_set_mtu:
>> > > > +	dev_close(slave_netdev);
>> > > > +err_interface_up:
>> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +upper_link_failed:
>> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> > > > +				       struct net_device *bypass_netdev,
>> > > > +				       struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > +	struct net_device *backup_netdev, *active_netdev;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_pre_unregister)
>> > > > +			return -EINVAL;
>> > > > +
>> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> > > > +							bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > +		return -EINVAL;
>> > > > +
>> > > > +	return 0;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> > > > +				struct net_device *bypass_netdev,
>> > > > +				struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > +	struct net_device *backup_netdev, *active_netdev;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_release)
>> > > > +			return -EINVAL;
>> > > I think it would be good to make the API to the driver more strict and
>> > > have a separate set of ops for "active" and "backup" netdevices.
>> > > That should stop people thinking about extending this to more slaves in
>> > > the future.
>> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> > 'active' slave.
>> I'm very well aware of that. I just thought that explicit ops for the
>> two slaves would make this more clear.
>> 
>> 
>> > 
>> > > 
>> > > 
>> > > > +
>> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev == backup_netdev) {
>> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> > > > +	} else {
>> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> > > > +		if (backup_netdev) {
>> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	dev_put(slave_netdev);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +	return 0;
>> > > > +}
>> > > > +
>> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	int ret;
>> > > > +
>> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> > > > +		goto done;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> > > > +					  bypass_ops);
>> > > > +	if (ret != 0)
>> > > > +		goto done;
>> > > > +
>> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +
>> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> > > > +
>> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> > > > +{
>> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> > > > +		goto done;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_link_change)
>> > > > +			goto done;
>> > > > +
>> > > > +		return bypass_ops->slave_link_change(slave_netdev,
>> > > > +						     bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	if (!netif_running(bypass_netdev))
>> > > > +		return 0;
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > +		goto done;
>> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> > > above is enough.
>> > I think we need this check to not allow events from a slave that is not
>> > attached to this master but has the same MAC.
>> Why do we need such events? Seems wrong to me.
>
>We want to avoid events from a netdev that is mis-configured with the same MAC as
>a bypass setup.
>
>>   Consider:
>> 
>> bp1      bp2
>> a1 b1    a2 b2
>> 
>> 
>> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>
>We should not have 2 bypass configs with the same MAC.
>I need to add a check in the bypass_master_register() to prevent this.

Mac can change, you would have to check in change as well. Feels odd
thought. 


>
>The above check is to avoid cases where we have
>bp1(a1, b1) with mac1
>and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>
>> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> the order of creation.
>> Let's say it will return bp1. Then when we have event for a2, the
>> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> 
>> 
>> You cannot use bypass_master_get_bymac() here.
>> 
>> 
>> 
>> > > 
>> > > > +
>> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> > > > +		netif_carrier_on(bypass_netdev);
>> > > > +		netif_tx_wake_all_queues(bypass_netdev);
>> > > > +	} else {
>> > > > +		netif_carrier_off(bypass_netdev);
>> > > > +		netif_tx_stop_all_queues(bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> > > > +{
>> > > > +	/* Skip parent events */
>> > > > +	if (netif_is_bypass_master(dev))
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid non-Ethernet type devices */
>> > > > +	if (dev->type != ARPHRD_ETHER)
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
>> > > > +	if (is_vlan_dev(dev))
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> > > helpers netif_is_bond_master().
>> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> > > 
>> > > You need to do it not by blacklisting, but with whitelisting. You need
>> > > to whitelist VF devices. My port flavours patchset might help with this.
>> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> I don't see such function in the code.
>
>It is netdev_has_any_lower_dev(). I need to export it.

Come on, you cannot use that. That would allow bonding without slaves,
but the slaves could be added later on.

What exactly you are trying to achieve by this?


>
>> 
>> 
>> > device is not an upper dev.
>> > Can you point to your port flavours patchset? Is it upstream?
>> I sent rfc couple of weeks ago:
>> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]           ` <20180418191315.GA1922@nanopsycho>
@ 2018-04-18 19:46             ` Michael S. Tsirkin
  2018-04-18 20:32               ` Jiri Pirko
       [not found]               ` <20180418203206.GC1922@nanopsycho>
  0 siblings, 2 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2018-04-18 19:46 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
	virtualization, loseweigh, netdev, davem

On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > This provides a generic interface for paravirtual drivers to listen
> >> > > > for netdev register/unregister/link change events from pci ethernet
> >> > > > devices with the same MAC and takeover their datapath. The notifier and
> >> > > > event handling code is based on the existing netvsc implementation.
> >> > > > 
> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
> >> > > > master netdev is created. The paravirtual driver registers each bypass
> >> > > > instance along with a set of ops to manage the slave events.
> >> > > >       bypass_master_register()
> >> > > >       bypass_master_unregister()
> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
> >> > > > the bypass module provides interfaces to create/destroy additional master
> >> > > > netdev and all the slave events are managed internally.
> >> > > >        bypass_master_create()
> >> > > >        bypass_master_destroy()
> >> > > > 
> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > > > ---
> >> > > > include/linux/netdevice.h |  14 +
> >> > > > include/net/bypass.h      |  96 ++++++
> >> > > > net/Kconfig               |  18 +
> >> > > > net/core/Makefile         |   1 +
> >> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> >> > > > 5 files changed, 973 insertions(+)
> >> > > > create mode 100644 include/net/bypass.h
> >> > > > create mode 100644 net/core/bypass.c
> >> > > > 
> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> > > > index cf44503ea81a..587293728f70 100644
> >> > > > --- a/include/linux/netdevice.h
> >> > > > +++ b/include/linux/netdevice.h
> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> >> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
> >> > > > 	IFF_MACSEC			= 1<<25,
> >> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
> >> > > > +	IFF_BYPASS			= 1 << 27,
> >> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
> >> > > I wonder, why you don't follow the existing coding style... Also, please
> >> > > add these to into the comment above.
> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
> >> > to the existing coding style to be consistent.
> >> Please do.
> >> 
> >> 
> >> > > 
> >> > > > };
> >> > > > 
> >> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> >> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
> >> > > > #define IFF_MACSEC			IFF_MACSEC
> >> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
> >> > > > +#define IFF_BYPASS			IFF_BYPASS
> >> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
> >> > > > 
> >> > > > /**
> >> > > >    *	struct net_device - The DEVICE structure.
> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> >> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
> >> > > > }
> >> > > > 
> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
> >> > > > +{
> >> > > > +	return dev->priv_flags & IFF_BYPASS;
> >> > > > +}
> >> > > > +
> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
> >> > > > +{
> >> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
> >> > > > +}
> >> > > > +
> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> >> > > > static inline void netif_keep_dst(struct net_device *dev)
> >> > > > {
> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
> >> > > > new file mode 100644
> >> > > > index 000000000000..86b02cb894cf
> >> > > > --- /dev/null
> >> > > > +++ b/include/net/bypass.h
> >> > > > @@ -0,0 +1,96 @@
> >> > > > +// SPDX-License-Identifier: GPL-2.0
> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
> >> > > > +
> >> > > > +#ifndef _NET_BYPASS_H
> >> > > > +#define _NET_BYPASS_H
> >> > > > +
> >> > > > +#include <linux/netdevice.h>
> >> > > > +
> >> > > > +struct bypass_ops {
> >> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
> >> > > > +				  struct net_device *bypass_netdev);
> >> > > > +	int (*slave_join)(struct net_device *slave_netdev,
> >> > > > +			  struct net_device *bypass_netdev);
> >> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
> >> > > > +				    struct net_device *bypass_netdev);
> >> > > > +	int (*slave_release)(struct net_device *slave_netdev,
> >> > > > +			     struct net_device *bypass_netdev);
> >> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
> >> > > > +				 struct net_device *bypass_netdev);
> >> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
> >> > > > +};
> >> > > > +
> >> > > > +struct bypass_master {
> >> > > > +	struct list_head list;
> >> > > > +	struct net_device __rcu *bypass_netdev;
> >> > > > +	struct bypass_ops __rcu *ops;
> >> > > > +};
> >> > > > +
> >> > > > +/* bypass state */
> >> > > > +struct bypass_info {
> >> > > > +	/* passthru netdev with same MAC */
> >> > > > +	struct net_device __rcu *active_netdev;
> >> > > You still use "active"/"backup" names which is highly misleading as
> >> > > it has completely different meaning that in bond for example.
> >> > > I noted that in my previous review already. Please change it.
> >> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
> >> > matches with the BACKUP feature bit we are adding to virtio_net.
> >> I think that "backup" is also misleading. Both "active" and "backup"
> >> mean a *state* of slaves. This should be named differently.
> >> 
> >> 
> >> 
> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
> >> > am not too happy with it.
> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> >> No. The netdev could be any netdevice. It does not have to be a "VF".
> >> I think "stolen" is quite appropriate since it describes the modus
> >> operandi. The bypass master steals some netdevice according to some
> >> match.
> >> 
> >> But I don't insist on "stolen". Just sounds right.
> >
> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
> >'backup' name is consistent.
> 
> It perhaps makes sense from the view of virtio device. However, as I
> described couple of times, for master/slave device the name "backup" is
> highly misleading.

virtio is the backup. You are supposed to use another
(typically passthrough) device, if that fails use virtio.
It does seem appropriate to me. If you like, we can
change that to "standby".  Active I don't like either. "main"?

In fact would failover be better than bypass?


> 
> >
> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
> >
> >Will look for any suggestions in the next day or two. If i don't get any, i will go
> >with 'stolen'
> >
> ><snip>
> >
> >
> >> +
> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> >> +						  struct bypass_ops **ops)
> >> +{
> >> +	struct bypass_master *bypass_master;
> >> +	struct net_device *bypass_netdev;
> >> +
> >> +	spin_lock(&bypass_lock);
> >> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
> >> > > As I wrote the last time, you don't need this list, spinlock.
> >> > > You can do just something like:
> >> > >           for_each_net(net) {
> >> > >                   for_each_netdev(net, dev) {
> >> > > 			if (netif_is_bypass_master(dev)) {
> >> > This function returns the upper netdev as well as the ops associated
> >> > with that netdev.
> >> > bypass_master_list is a list of 'struct bypass_master' that associates
> >> Well, can't you have it in netdev priv?
> >
> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
> 
> Howcome? You have no master? I don't understand..
> 
> 
> 
> >
> >> 
> >> 
> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
> >> > NULL for 3-netdev model.
> >> I see :(
> >> 
> >> 
> >> > 
> >> > > 
> >> > > 
> >> > > 
> >> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
> >> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
> >> > > > +			*ops = rcu_dereference(bypass_master->ops);
> >> > > I don't see how rcu_dereference is ok here.
> >> > > 1) I don't see rcu_read_lock taken
> >> > > 2) Looks like bypass_master->ops has the same value across the whole
> >> > >      existence.
> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
> >> > Yes. ops doesn't change.
> >> If it does not change, you can just access it directly.
> >> 
> >> 
> >> > > 
> >> > > > +			spin_unlock(&bypass_lock);
> >> > > > +			return bypass_netdev;
> >> > > > +		}
> >> > > > +	}
> >> > > > +	spin_unlock(&bypass_lock);
> >> > > > +	return NULL;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	int ret, orig_mtu;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > For master, could you use word "master" in the variables so it is clear?
> >> > > Also, "dev" is fine instead of "netdev".
> >> > > Something like "bpmaster_dev"
> >> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
> >> I was trying to point out, that "bypass_netdev" represents a "master"
> >> netdev, yet it does not say master. That is why I suggested
> >> "bpmaster_dev"
> >> 
> >> 
> >> > I can change all _netdev suffixes to _dev to make the names shorter.
> >> ok.
> >> 
> >> 
> >> > 
> >> > > 
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
> >> > > > +					bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > 	Just "if (ret)" will do. You have this on more places.
> >> > OK.
> >> > 
> >> > 
> >> > > 
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = netdev_rx_handler_register(slave_netdev,
> >> > > > +					 bypass_ops ? bypass_ops->handle_frame :
> >> > > > +					 bypass_handle_frame, bypass_netdev);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
> >> > > > +			   ret);
> >> > > > +		goto done;
> >> > > > +	}
> >> > > > +
> >> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
> >> > > > +			   bypass_netdev->name, ret);
> >> > > > +		goto upper_link_failed;
> >> > > > +	}
> >> > > > +
> >> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > +	if (netif_running(bypass_netdev)) {
> >> > > > +		ret = dev_open(slave_netdev);
> >> > > > +		if (ret && (ret != -EBUSY)) {
> >> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
> >> > > > +				   slave_netdev->name, ret);
> >> > > > +			goto err_interface_up;
> >> > > > +		}
> >> > > > +	}
> >> > > > +
> >> > > > +	/* Align MTU of slave with master */
> >> > > > +	orig_mtu = slave_netdev->mtu;
> >> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
> >> > > > +			   slave_netdev->name, bypass_netdev->mtu);
> >> > > > +		goto err_set_mtu;
> >> > > > +	}
> >> > > > +
> >> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > > +		goto err_join;
> >> > > > +
> >> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +	goto done;
> >> > > > +
> >> > > > +err_join:
> >> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
> >> > > > +err_set_mtu:
> >> > > > +	dev_close(slave_netdev);
> >> > > > +err_interface_up:
> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +upper_link_failed:
> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
> >> > > > +				       struct net_device *bypass_netdev,
> >> > > > +				       struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > +	struct net_device *backup_netdev, *active_netdev;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_pre_unregister)
> >> > > > +			return -EINVAL;
> >> > > > +
> >> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
> >> > > > +							bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > +		return -EINVAL;
> >> > > > +
> >> > > > +	return 0;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
> >> > > > +				struct net_device *bypass_netdev,
> >> > > > +				struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > +	struct net_device *backup_netdev, *active_netdev;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_release)
> >> > > > +			return -EINVAL;
> >> > > I think it would be good to make the API to the driver more strict and
> >> > > have a separate set of ops for "active" and "backup" netdevices.
> >> > > That should stop people thinking about extending this to more slaves in
> >> > > the future.
> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
> >> > 'active' slave.
> >> I'm very well aware of that. I just thought that explicit ops for the
> >> two slaves would make this more clear.
> >> 
> >> 
> >> > 
> >> > > 
> >> > > 
> >> > > > +
> >> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev == backup_netdev) {
> >> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
> >> > > > +	} else {
> >> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
> >> > > > +		if (backup_netdev) {
> >> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
> >> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
> >> > > > +		}
> >> > > > +	}
> >> > > > +
> >> > > > +	dev_put(slave_netdev);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +	return 0;
> >> > > > +}
> >> > > > +
> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	int ret;
> >> > > > +
> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
> >> > > > +					  bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
> >> > > > +
> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
> >> > > > +{
> >> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_link_change)
> >> > > > +			goto done;
> >> > > > +
> >> > > > +		return bypass_ops->slave_link_change(slave_netdev,
> >> > > > +						     bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	if (!netif_running(bypass_netdev))
> >> > > > +		return 0;
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > +		goto done;
> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> >> > > above is enough.
> >> > I think we need this check to not allow events from a slave that is not
> >> > attached to this master but has the same MAC.
> >> Why do we need such events? Seems wrong to me.
> >
> >We want to avoid events from a netdev that is mis-configured with the same MAC as
> >a bypass setup.
> >
> >>   Consider:
> >> 
> >> bp1      bp2
> >> a1 b1    a2 b2
> >> 
> >> 
> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
> >
> >We should not have 2 bypass configs with the same MAC.
> >I need to add a check in the bypass_master_register() to prevent this.
> 
> Mac can change, you would have to check in change as well. Feels odd
> thought. 
> 
> 
> >
> >The above check is to avoid cases where we have
> >bp1(a1, b1) with mac1
> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
> >
> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> >> the order of creation.
> >> Let's say it will return bp1. Then when we have event for a2, the
> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
> >> 
> >> 
> >> You cannot use bypass_master_get_bymac() here.
> >> 
> >> 
> >> 
> >> > > 
> >> > > > +
> >> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
> >> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
> >> > > > +		netif_carrier_on(bypass_netdev);
> >> > > > +		netif_tx_wake_all_queues(bypass_netdev);
> >> > > > +	} else {
> >> > > > +		netif_carrier_off(bypass_netdev);
> >> > > > +		netif_tx_stop_all_queues(bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
> >> > > > +{
> >> > > > +	/* Skip parent events */
> >> > > > +	if (netif_is_bypass_master(dev))
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid non-Ethernet type devices */
> >> > > > +	if (dev->type != ARPHRD_ETHER)
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
> >> > > > +	if (is_vlan_dev(dev))
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
> >> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
> >> > > helpers netif_is_bond_master().
> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
> >> > > 
> >> > > You need to do it not by blacklisting, but with whitelisting. You need
> >> > > to whitelist VF devices. My port flavours patchset might help with this.
> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
> >> I don't see such function in the code.
> >
> >It is netdev_has_any_lower_dev(). I need to export it.
> 
> Come on, you cannot use that. That would allow bonding without slaves,
> but the slaves could be added later on.
> 
> What exactly you are trying to achieve by this?
> 
> 
> >
> >> 
> >> 
> >> > device is not an upper dev.
> >> > Can you point to your port flavours patchset? Is it upstream?
> >> I sent rfc couple of weeks ago:
> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
> >
> >
> >

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-18 19:46             ` Michael S. Tsirkin
@ 2018-04-18 20:32               ` Jiri Pirko
       [not found]               ` <20180418203206.GC1922@nanopsycho>
  1 sibling, 0 replies; 47+ messages in thread
From: Jiri Pirko @ 2018-04-18 20:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
	virtualization, loseweigh, netdev, davem

Wed, Apr 18, 2018 at 09:46:04PM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > This provides a generic interface for paravirtual drivers to listen
>> >> > > > for netdev register/unregister/link change events from pci ethernet
>> >> > > > devices with the same MAC and takeover their datapath. The notifier and
>> >> > > > event handling code is based on the existing netvsc implementation.
>> >> > > > 
>> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> >> > > > master netdev is created. The paravirtual driver registers each bypass
>> >> > > > instance along with a set of ops to manage the slave events.
>> >> > > >       bypass_master_register()
>> >> > > >       bypass_master_unregister()
>> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> >> > > > the bypass module provides interfaces to create/destroy additional master
>> >> > > > netdev and all the slave events are managed internally.
>> >> > > >        bypass_master_create()
>> >> > > >        bypass_master_destroy()
>> >> > > > 
>> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > > > ---
>> >> > > > include/linux/netdevice.h |  14 +
>> >> > > > include/net/bypass.h      |  96 ++++++
>> >> > > > net/Kconfig               |  18 +
>> >> > > > net/core/Makefile         |   1 +
>> >> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > 5 files changed, 973 insertions(+)
>> >> > > > create mode 100644 include/net/bypass.h
>> >> > > > create mode 100644 net/core/bypass.c
>> >> > > > 
>> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> >> > > > index cf44503ea81a..587293728f70 100644
>> >> > > > --- a/include/linux/netdevice.h
>> >> > > > +++ b/include/linux/netdevice.h
>> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> >> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
>> >> > > > 	IFF_MACSEC			= 1<<25,
>> >> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
>> >> > > > +	IFF_BYPASS			= 1 << 27,
>> >> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> >> > > I wonder, why you don't follow the existing coding style... Also, please
>> >> > > add these to into the comment above.
>> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> >> > to the existing coding style to be consistent.
>> >> Please do.
>> >> 
>> >> 
>> >> > > 
>> >> > > > };
>> >> > > > 
>> >> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> >> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> >> > > > #define IFF_MACSEC			IFF_MACSEC
>> >> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> >> > > > +#define IFF_BYPASS			IFF_BYPASS
>> >> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> >> > > > 
>> >> > > > /**
>> >> > > >    *	struct net_device - The DEVICE structure.
>> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> >> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> >> > > > }
>> >> > > > 
>> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return dev->priv_flags & IFF_BYPASS;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> >> > > > +}
>> >> > > > +
>> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> >> > > > static inline void netif_keep_dst(struct net_device *dev)
>> >> > > > {
>> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> >> > > > new file mode 100644
>> >> > > > index 000000000000..86b02cb894cf
>> >> > > > --- /dev/null
>> >> > > > +++ b/include/net/bypass.h
>> >> > > > @@ -0,0 +1,96 @@
>> >> > > > +// SPDX-License-Identifier: GPL-2.0
>> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> >> > > > +
>> >> > > > +#ifndef _NET_BYPASS_H
>> >> > > > +#define _NET_BYPASS_H
>> >> > > > +
>> >> > > > +#include <linux/netdevice.h>
>> >> > > > +
>> >> > > > +struct bypass_ops {
>> >> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> >> > > > +				  struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_join)(struct net_device *slave_netdev,
>> >> > > > +			  struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> >> > > > +				    struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_release)(struct net_device *slave_netdev,
>> >> > > > +			     struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> >> > > > +				 struct net_device *bypass_netdev);
>> >> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> >> > > > +};
>> >> > > > +
>> >> > > > +struct bypass_master {
>> >> > > > +	struct list_head list;
>> >> > > > +	struct net_device __rcu *bypass_netdev;
>> >> > > > +	struct bypass_ops __rcu *ops;
>> >> > > > +};
>> >> > > > +
>> >> > > > +/* bypass state */
>> >> > > > +struct bypass_info {
>> >> > > > +	/* passthru netdev with same MAC */
>> >> > > > +	struct net_device __rcu *active_netdev;
>> >> > > You still use "active"/"backup" names which is highly misleading as
>> >> > > it has completely different meaning that in bond for example.
>> >> > > I noted that in my previous review already. Please change it.
>> >> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> >> > matches with the BACKUP feature bit we are adding to virtio_net.
>> >> I think that "backup" is also misleading. Both "active" and "backup"
>> >> mean a *state* of slaves. This should be named differently.
>> >> 
>> >> 
>> >> 
>> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> > am not too happy with it.
>> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> I think "stolen" is quite appropriate since it describes the modus
>> >> operandi. The bypass master steals some netdevice according to some
>> >> match.
>> >> 
>> >> But I don't insist on "stolen". Just sounds right.
>> >
>> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >'backup' name is consistent.
>> 
>> It perhaps makes sense from the view of virtio device. However, as I
>> described couple of times, for master/slave device the name "backup" is
>> highly misleading.
>
>virtio is the backup. You are supposed to use another
>(typically passthrough) device, if that fails use virtio.
>It does seem appropriate to me. If you like, we can
>change that to "standby".  Active I don't like either. "main"?

Sounds much better, yes.


>
>In fact would failover be better than bypass?

Also, much better.


>
>
>> 
>> >
>> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> >
>> >Will look for any suggestions in the next day or two. If i don't get any, i will go
>> >with 'stolen'
>> >
>> ><snip>
>> >
>> >
>> >> +
>> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> >> +						  struct bypass_ops **ops)
>> >> +{
>> >> +	struct bypass_master *bypass_master;
>> >> +	struct net_device *bypass_netdev;
>> >> +
>> >> +	spin_lock(&bypass_lock);
>> >> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> >> > > As I wrote the last time, you don't need this list, spinlock.
>> >> > > You can do just something like:
>> >> > >           for_each_net(net) {
>> >> > >                   for_each_netdev(net, dev) {
>> >> > > 			if (netif_is_bypass_master(dev)) {
>> >> > This function returns the upper netdev as well as the ops associated
>> >> > with that netdev.
>> >> > bypass_master_list is a list of 'struct bypass_master' that associates
>> >> Well, can't you have it in netdev priv?
>> >
>> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
>> 
>> Howcome? You have no master? I don't understand..
>> 
>> 
>> 
>> >
>> >> 
>> >> 
>> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> >> > NULL for 3-netdev model.
>> >> I see :(
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > 
>> >> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> >> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> >> > > > +			*ops = rcu_dereference(bypass_master->ops);
>> >> > > I don't see how rcu_dereference is ok here.
>> >> > > 1) I don't see rcu_read_lock taken
>> >> > > 2) Looks like bypass_master->ops has the same value across the whole
>> >> > >      existence.
>> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> >> > Yes. ops doesn't change.
>> >> If it does not change, you can just access it directly.
>> >> 
>> >> 
>> >> > > 
>> >> > > > +			spin_unlock(&bypass_lock);
>> >> > > > +			return bypass_netdev;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +	spin_unlock(&bypass_lock);
>> >> > > > +	return NULL;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	int ret, orig_mtu;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > For master, could you use word "master" in the variables so it is clear?
>> >> > > Also, "dev" is fine instead of "netdev".
>> >> > > Something like "bpmaster_dev"
>> >> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
>> >> I was trying to point out, that "bypass_netdev" represents a "master"
>> >> netdev, yet it does not say master. That is why I suggested
>> >> "bpmaster_dev"
>> >> 
>> >> 
>> >> > I can change all _netdev suffixes to _dev to make the names shorter.
>> >> ok.
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> >> > > > +					bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > 	Just "if (ret)" will do. You have this on more places.
>> >> > OK.
>> >> > 
>> >> > 
>> >> > > 
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = netdev_rx_handler_register(slave_netdev,
>> >> > > > +					 bypass_ops ? bypass_ops->handle_frame :
>> >> > > > +					 bypass_handle_frame, bypass_netdev);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> >> > > > +			   ret);
>> >> > > > +		goto done;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> >> > > > +			   bypass_netdev->name, ret);
>> >> > > > +		goto upper_link_failed;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > +	if (netif_running(bypass_netdev)) {
>> >> > > > +		ret = dev_open(slave_netdev);
>> >> > > > +		if (ret && (ret != -EBUSY)) {
>> >> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> >> > > > +				   slave_netdev->name, ret);
>> >> > > > +			goto err_interface_up;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	/* Align MTU of slave with master */
>> >> > > > +	orig_mtu = slave_netdev->mtu;
>> >> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> >> > > > +			   slave_netdev->name, bypass_netdev->mtu);
>> >> > > > +		goto err_set_mtu;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > > +		goto err_join;
>> >> > > > +
>> >> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +	goto done;
>> >> > > > +
>> >> > > > +err_join:
>> >> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
>> >> > > > +err_set_mtu:
>> >> > > > +	dev_close(slave_netdev);
>> >> > > > +err_interface_up:
>> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +upper_link_failed:
>> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> >> > > > +				       struct net_device *bypass_netdev,
>> >> > > > +				       struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > +	struct net_device *backup_netdev, *active_netdev;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_pre_unregister)
>> >> > > > +			return -EINVAL;
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> >> > > > +							bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > +		return -EINVAL;
>> >> > > > +
>> >> > > > +	return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> >> > > > +				struct net_device *bypass_netdev,
>> >> > > > +				struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > +	struct net_device *backup_netdev, *active_netdev;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_release)
>> >> > > > +			return -EINVAL;
>> >> > > I think it would be good to make the API to the driver more strict and
>> >> > > have a separate set of ops for "active" and "backup" netdevices.
>> >> > > That should stop people thinking about extending this to more slaves in
>> >> > > the future.
>> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> >> > 'active' slave.
>> >> I'm very well aware of that. I just thought that explicit ops for the
>> >> two slaves would make this more clear.
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev == backup_netdev) {
>> >> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> >> > > > +	} else {
>> >> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> >> > > > +		if (backup_netdev) {
>> >> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> >> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	dev_put(slave_netdev);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +	return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	int ret;
>> >> > > > +
>> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> >> > > > +					  bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> >> > > > +
>> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_link_change)
>> >> > > > +			goto done;
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_link_change(slave_netdev,
>> >> > > > +						     bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	if (!netif_running(bypass_netdev))
>> >> > > > +		return 0;
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > +		goto done;
>> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> >> > > above is enough.
>> >> > I think we need this check to not allow events from a slave that is not
>> >> > attached to this master but has the same MAC.
>> >> Why do we need such events? Seems wrong to me.
>> >
>> >We want to avoid events from a netdev that is mis-configured with the same MAC as
>> >a bypass setup.
>> >
>> >>   Consider:
>> >> 
>> >> bp1      bp2
>> >> a1 b1    a2 b2
>> >> 
>> >> 
>> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>> >
>> >We should not have 2 bypass configs with the same MAC.
>> >I need to add a check in the bypass_master_register() to prevent this.
>> 
>> Mac can change, you would have to check in change as well. Feels odd
>> thought. 
>> 
>> 
>> >
>> >The above check is to avoid cases where we have
>> >bp1(a1, b1) with mac1
>> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>> >
>> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> >> the order of creation.
>> >> Let's say it will return bp1. Then when we have event for a2, the
>> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> >> 
>> >> 
>> >> You cannot use bypass_master_get_bymac() here.
>> >> 
>> >> 
>> >> 
>> >> > > 
>> >> > > > +
>> >> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> >> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> >> > > > +		netif_carrier_on(bypass_netdev);
>> >> > > > +		netif_tx_wake_all_queues(bypass_netdev);
>> >> > > > +	} else {
>> >> > > > +		netif_carrier_off(bypass_netdev);
>> >> > > > +		netif_tx_stop_all_queues(bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> >> > > > +{
>> >> > > > +	/* Skip parent events */
>> >> > > > +	if (netif_is_bypass_master(dev))
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid non-Ethernet type devices */
>> >> > > > +	if (dev->type != ARPHRD_ETHER)
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
>> >> > > > +	if (is_vlan_dev(dev))
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> >> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> >> > > helpers netif_is_bond_master().
>> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> >> > > 
>> >> > > You need to do it not by blacklisting, but with whitelisting. You need
>> >> > > to whitelist VF devices. My port flavours patchset might help with this.
>> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> >> I don't see such function in the code.
>> >
>> >It is netdev_has_any_lower_dev(). I need to export it.
>> 
>> Come on, you cannot use that. That would allow bonding without slaves,
>> but the slaves could be added later on.
>> 
>> What exactly you are trying to achieve by this?
>> 
>> 
>> >
>> >> 
>> >> 
>> >> > device is not an upper dev.
>> >> > Can you point to your port flavours patchset? Is it upstream?
>> >> I sent rfc couple of weeks ago:
>> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>> >
>> >
>> >

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]               ` <20180418203206.GC1922@nanopsycho>
@ 2018-04-18 22:46                 ` Samudrala, Sridhar
  2018-04-19  4:08                 ` Michael S. Tsirkin
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18 22:46 UTC (permalink / raw)
  To: Jiri Pirko, Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, kubakici, netdev, virtualization,
	loseweigh, davem

On 4/18/2018 1:32 PM, Jiri Pirko wrote:
>>>>>>> You still use "active"/"backup" names which is highly misleading as
>>>>>>> it has completely different meaning that in bond for example.
>>>>>>> I noted that in my previous review already. Please change it.
>>>>>> I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>>>>>> matches with the BACKUP feature bit we are adding to virtio_net.
>>>>> I think that "backup" is also misleading. Both "active" and "backup"
>>>>> mean a *state* of slaves. This should be named differently.
>>>>>
>>>>>
>>>>>
>>>>>> With regards to alternate names for 'active', you suggested 'stolen', but i
>>>>>> am not too happy with it.
>>>>>> netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>>>>> No. The netdev could be any netdevice. It does not have to be a "VF".
>>>>> I think "stolen" is quite appropriate since it describes the modus
>>>>> operandi. The bypass master steals some netdevice according to some
>>>>> match.
>>>>>
>>>>> But I don't insist on "stolen". Just sounds right.
>>>> We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>>>> 'backup' name is consistent.
>>> It perhaps makes sense from the view of virtio device. However, as I
>>> described couple of times, for master/slave device the name "backup" is
>>> highly misleading.
>> virtio is the backup. You are supposed to use another
>> (typically passthrough) device, if that fails use virtio.
>> It does seem appropriate to me. If you like, we can
>> change that to "standby".  Active I don't like either. "main"?
> Sounds much better, yes.

OK. Will change backup to 'standby'.
'main' is fine, what about 'primary'?


>
>
>> In fact would failover be better than bypass?
> Also, much better.

So do we want to change all 'bypass' references to 'failover' including
the filenames.(net/core/failover.c and include/net/failover.h)

<snip>



>
>
>>
>>>> The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>>>> a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>>>>
>>>> Will look for any suggestions in the next day or two. If i don't get any, i will go
>>>> with 'stolen'
>>>>
>>>> <snip>
>>>>
>>>>
>>>>> +
>>>>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>>>>> +						  struct bypass_ops **ops)
>>>>> +{
>>>>> +	struct bypass_master *bypass_master;
>>>>> +	struct net_device *bypass_netdev;
>>>>> +
>>>>> +	spin_lock(&bypass_lock);
>>>>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>>>>>>> As I wrote the last time, you don't need this list, spinlock.
>>>>>>> You can do just something like:
>>>>>>>            for_each_net(net) {
>>>>>>>                    for_each_netdev(net, dev) {
>>>>>>> 			if (netif_is_bypass_master(dev)) {
>>>>>> This function returns the upper netdev as well as the ops associated
>>>>>> with that netdev.
>>>>>> bypass_master_list is a list of 'struct bypass_master' that associates
>>>>> Well, can't you have it in netdev priv?
>>>> We cannot do this for 2-netdev model as there is no bypass_netdev created.
>>> Howcome? You have no master? I don't understand..

For 2-netdev model, the master netdev is not a new one created by the bypass module.
It is created by netvsc internally and passed via bypass_master_register()

<snip>



>>>
>>>>>>>> +
>>>>>>>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>>>>>>>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>>>>>>> Yeah, this is certainly incorrect. One thing is, you should be using the
>>>>>>> helpers netif_is_bond_master().
>>>>>>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>>>>>>>
>>>>>>> You need to do it not by blacklisting, but with whitelisting. You need
>>>>>>> to whitelist VF devices. My port flavours patchset might help with this.
>>>>>> May be i can use netdev_has_lower_dev() helper to make sure that the slave
>>>>> I don't see such function in the code.
>>>> It is netdev_has_any_lower_dev(). I need to export it.
>>> Come on, you cannot use that. That would allow bonding without slaves,
>>> but the slaves could be added later on.
>>>
>>> What exactly you are trying to achieve by this?

I think i can remove this check.  In pre-register,
for backup device, i check that its parent matches bypass device &
for vf device, we make sure that it is a pci device.


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]               ` <20180418203206.GC1922@nanopsycho>
  2018-04-18 22:46                 ` Samudrala, Sridhar
@ 2018-04-19  4:08                 ` Michael S. Tsirkin
       [not found]                 ` <ff0c5ea1-16d4-00a7-9952-9049efa818eb@intel.com>
       [not found]                 ` <20180419070752-mutt-send-email-mst@kernel.org>
  3 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2018-04-19  4:08 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
	virtualization, loseweigh, netdev, davem

On Wed, Apr 18, 2018 at 10:32:06PM +0200, Jiri Pirko wrote:
> >> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
> >> >> > am not too happy with it.
> >> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> >> >> No. The netdev could be any netdevice. It does not have to be a "VF".
> >> >> I think "stolen" is quite appropriate since it describes the modus
> >> >> operandi. The bypass master steals some netdevice according to some
> >> >> match.
> >> >> 
> >> >> But I don't insist on "stolen". Just sounds right.
> >> >
> >> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
> >> >'backup' name is consistent.
> >> 
> >> It perhaps makes sense from the view of virtio device. However, as I
> >> described couple of times, for master/slave device the name "backup" is
> >> highly misleading.
> >
> >virtio is the backup. You are supposed to use another
> >(typically passthrough) device, if that fails use virtio.
> >It does seem appropriate to me. If you like, we can
> >change that to "standby".  Active I don't like either. "main"?
> 
> Sounds much better, yes.

Excuse me, which of the versions are better in your eyes?


> 
> >
> >In fact would failover be better than bypass?
> 
> Also, much better.
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]                 ` <ff0c5ea1-16d4-00a7-9952-9049efa818eb@intel.com>
@ 2018-04-19  6:35                   ` Jiri Pirko
  0 siblings, 0 replies; 47+ messages in thread
From: Jiri Pirko @ 2018-04-19  6:35 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, Michael S. Tsirkin, kubakici,
	netdev, virtualization, loseweigh, davem

Thu, Apr 19, 2018 at 12:46:11AM CEST, sridhar.samudrala@intel.com wrote:
>On 4/18/2018 1:32 PM, Jiri Pirko wrote:
>> > > > > > > You still use "active"/"backup" names which is highly misleading as
>> > > > > > > it has completely different meaning that in bond for example.
>> > > > > > > I noted that in my previous review already. Please change it.
>> > > > > > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> > > > > > matches with the BACKUP feature bit we are adding to virtio_net.
>> > > > > I think that "backup" is also misleading. Both "active" and "backup"
>> > > > > mean a *state* of slaves. This should be named differently.
>> > > > > 
>> > > > > 
>> > > > > 
>> > > > > > With regards to alternate names for 'active', you suggested 'stolen', but i
>> > > > > > am not too happy with it.
>> > > > > > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> > > > > No. The netdev could be any netdevice. It does not have to be a "VF".
>> > > > > I think "stolen" is quite appropriate since it describes the modus
>> > > > > operandi. The bypass master steals some netdevice according to some
>> > > > > match.
>> > > > > 
>> > > > > But I don't insist on "stolen". Just sounds right.
>> > > > We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> > > > 'backup' name is consistent.
>> > > It perhaps makes sense from the view of virtio device. However, as I
>> > > described couple of times, for master/slave device the name "backup" is
>> > > highly misleading.
>> > virtio is the backup. You are supposed to use another
>> > (typically passthrough) device, if that fails use virtio.
>> > It does seem appropriate to me. If you like, we can
>> > change that to "standby".  Active I don't like either. "main"?
>> Sounds much better, yes.
>
>OK. Will change backup to 'standby'.
>'main' is fine, what about 'primary'?

Primary is also bonding terminology. But in this case, I think it would
fit. The primary slave is used as the active one whenever the link is
up.


>
>
>> 
>> 
>> > In fact would failover be better than bypass?
>> Also, much better.
>
>So do we want to change all 'bypass' references to 'failover' including
>the filenames.(net/core/failover.c and include/net/failover.h)
>
><snip>
>
>
>
>> 
>> 
>> > 
>> > > > The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> > > > a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> > > > 
>> > > > Will look for any suggestions in the next day or two. If i don't get any, i will go
>> > > > with 'stolen'
>> > > > 
>> > > > <snip>
>> > > > 
>> > > > 
>> > > > > +
>> > > > > +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> > > > > +						  struct bypass_ops **ops)
>> > > > > +{
>> > > > > +	struct bypass_master *bypass_master;
>> > > > > +	struct net_device *bypass_netdev;
>> > > > > +
>> > > > > +	spin_lock(&bypass_lock);
>> > > > > +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> > > > > > > As I wrote the last time, you don't need this list, spinlock.
>> > > > > > > You can do just something like:
>> > > > > > >            for_each_net(net) {
>> > > > > > >                    for_each_netdev(net, dev) {
>> > > > > > > 			if (netif_is_bypass_master(dev)) {
>> > > > > > This function returns the upper netdev as well as the ops associated
>> > > > > > with that netdev.
>> > > > > > bypass_master_list is a list of 'struct bypass_master' that associates
>> > > > > Well, can't you have it in netdev priv?
>> > > > We cannot do this for 2-netdev model as there is no bypass_netdev created.
>> > > Howcome? You have no master? I don't understand..
>
>For 2-netdev model, the master netdev is not a new one created by the bypass module.
>It is created by netvsc internally and passed via bypass_master_register()

But virtio_net alho has to create the master and pass it down to the
bypass module. Howcome it is different?


>
><snip>
>
>
>
>> > > 
>> > > > > > > > +
>> > > > > > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > > > > > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> > > > > > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> > > > > > > helpers netif_is_bond_master().
>> > > > > > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> > > > > > > 
>> > > > > > > You need to do it not by blacklisting, but with whitelisting. You need
>> > > > > > > to whitelist VF devices. My port flavours patchset might help with this.
>> > > > > > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> > > > > I don't see such function in the code.
>> > > > It is netdev_has_any_lower_dev(). I need to export it.
>> > > Come on, you cannot use that. That would allow bonding without slaves,
>> > > but the slaves could be added later on.
>> > > 
>> > > What exactly you are trying to achieve by this?
>
>I think i can remove this check.  In pre-register,
>for backup device, i check that its parent matches bypass device &
>for vf device, we make sure that it is a pci device.

Okay. That is a start.


>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
       [not found]                 ` <20180419070752-mutt-send-email-mst@kernel.org>
@ 2018-04-19  7:22                   ` Jiri Pirko
  0 siblings, 0 replies; 47+ messages in thread
From: Jiri Pirko @ 2018-04-19  7:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
	virtualization, loseweigh, netdev, davem

Thu, Apr 19, 2018 at 06:08:58AM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 10:32:06PM +0200, Jiri Pirko wrote:
>> >> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> >> > am not too happy with it.
>> >> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> >> I think "stolen" is quite appropriate since it describes the modus
>> >> >> operandi. The bypass master steals some netdevice according to some
>> >> >> match.
>> >> >> 
>> >> >> But I don't insist on "stolen". Just sounds right.
>> >> >
>> >> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >> >'backup' name is consistent.
>> >> 
>> >> It perhaps makes sense from the view of virtio device. However, as I
>> >> described couple of times, for master/slave device the name "backup" is
>> >> highly misleading.
>> >
>> >virtio is the backup. You are supposed to use another
>> >(typically passthrough) device, if that fails use virtio.
>> >It does seem appropriate to me. If you like, we can
>> >change that to "standby".  Active I don't like either. "main"?
>> 
>> Sounds much better, yes.
>
>Excuse me, which of the versions are better in your eyes?

standby is okay. main/primary is fine too.

>
>
>> 
>> >
>> >In fact would failover be better than bypass?
>> 
>> Also, much better.
>> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]     ` <20180411075334.GK2028@nanopsycho>
@ 2019-02-22  1:14       ` Siwei Liu
       [not found]       ` <CADGSJ214RJV_zWVBGv0Ydo=CJj6WESTYAH=PpaYLFHdtWVrm3g@mail.gmail.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Siwei Liu @ 2019-02-22  1:14 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, virtio-dev, Michael S. Tsirkin, Jakub Kicinski,
	Sridhar Samudrala, virtualization, liran.alon, Netdev, si-wei liu,
	David Miller

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

On Wed, Apr 11, 2018 at 12:53 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen@networkplumber.org wrote:
> >On Tue, 10 Apr 2018 11:59:50 -0700
> >Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >
> >> Use the registration/notification framework supported by the generic
> >> bypass infrastructure.
> >>
> >> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> ---
> >
> >Thanks for doing this.  Your current version has couple show stopper
> >issues.
> >
> >First, the slave device is instantly taking over the slave.
> >This doesn't allow udev/systemd to do its device rename of the slave
> >device. Netvsc uses a delayed work to workaround this.
>
> Wait. Why the fact a device is enslaved has to affect the udev in any
> way? If it does, smells like a bug in udev.

See above for clarifications.

Thanks,


>
>
> >
> >Secondly, the select queue needs to call queue selection in VF.
> >The bonding/teaming logic doesn't work well for UDP flows.
> >Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> >fixed this performance problem.
> >
> >Lastly, more indirection is bad in current climate.
> >
> >I am not completely adverse to this but it needs to be fast, simple
> >and completely transparent.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]       ` <CADGSJ214RJV_zWVBGv0Ydo=CJj6WESTYAH=PpaYLFHdtWVrm3g@mail.gmail.com>
@ 2019-02-22  1:39         ` Michael S. Tsirkin
       [not found]         ` <20190221203808-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-22  1:39 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, liran.alon, Netdev, si-wei liu,
	David Miller

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> Sorry for replying to this ancient thread. There was some remaining
> issue that I don't think the initial net_failover patch got addressed
> cleanly, see:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> 
> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> not specifically writtten for such kernel automatic enslavement.
> Specifically, if it is a bond or team, the slave would typically get
> renamed *before* virtual device gets created, that's what udev can
> control (without getting netdev opened early by the other part of
> kernel) and other userspace components for e.g. initramfs,
> init-scripts can coordinate well in between. The in-kernel
> auto-enslavement of net_failover breaks this userspace convention,
> which don't provides a solution if user care about consistent naming
> on the slave netdevs specifically.
> 
> Previously this issue had been specifically called out when IFF_HIDDEN
> and the 1-netdev was proposed, but no one gives out a solution to this
> problem ever since. Please share your mind how to proceed and solve
> this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

	there's no motivation in the systemd/udevd community at
	this point to refactor the rename logic and make it work well with
	3-netdev.

What would the fix be? Skip slave devices?

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]           ` <581e4399-3969-aecd-e923-03bbc0880733@oracle.com>
@ 2019-02-22  7:00             ` Samudrala, Sridhar
       [not found]             ` <91d4cbb1-be7a-b53c-6b2a-99bef07e7c53@intel.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Samudrala, Sridhar @ 2019-02-22  7:00 UTC (permalink / raw)
  To: si-wei liu, Michael S. Tsirkin, Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
	virtualization, liran.alon, David Miller


[-- Attachment #1.1: Type: text/plain, Size: 2749 bytes --]


On 2/21/2019 7:33 PM, si-wei liu wrote:
>
>
> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>> Sorry for replying to this ancient thread. There was some remaining
>>> issue that I don't think the initial net_failover patch got addressed
>>> cleanly, see:
>>>
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>
>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>> not specifically writtten for such kernel automatic enslavement.
>>> Specifically, if it is a bond or team, the slave would typically get
>>> renamed *before* virtual device gets created, that's what udev can
>>> control (without getting netdev opened early by the other part of
>>> kernel) and other userspace components for e.g. initramfs,
>>> init-scripts can coordinate well in between. The in-kernel
>>> auto-enslavement of net_failover breaks this userspace convention,
>>> which don't provides a solution if user care about consistent naming
>>> on the slave netdevs specifically.
>>>
>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>> problem ever since. Please share your mind how to proceed and solve
>>> this userspace issue if netdev does not welcome a 1-netdev model.
>> Above says:
>>
>>     there's no motivation in the systemd/udevd community at
>>     this point to refactor the rename logic and make it work well with
>>     3-netdev.
>>
>> What would the fix be? Skip slave devices?
>>
> There's nothing user can get if just skipping slave devices - the name 
> is still unchanged and unpredictable e.g. eth0, or eth1 the next 
> reboot, while the rest may conform to the naming scheme (ens3 and 
> such). There's no way one can fix this in userspace alone - when the 
> failover is created the enslaved netdev was opened by the kernel 
> earlier than the userspace is made aware of, and there's no 
> negotiation protocol for kernel to know when userspace has done 
> initial renaming of the interface. I would expect netdev list should 
> at least provide the direction in general for how this can be solved...
>
Is there an issue if slave device names are not predictable? The user/admin scripts are expected
to only work with the master failover device.
Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
about moving them to a hidden network namespace so that they are not visible from the default namespace.
I looked into this sometime back, but did not find the right kernel api to create a network namespace within
kernel. If so, we could use this mechanism to simulate a 1-netdev model.


> -Siwei
>
>

[-- Attachment #1.2: Type: text/html, Size: 4246 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]               ` <d9ef40a2-237b-0cce-4401-ecaeac4c602a@oracle.com>
@ 2019-02-22 15:14                 ` Michael S. Tsirkin
       [not found]                 ` <20190222100753-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-22 15:14 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	David Miller

On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> 
> 
> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > 
> > 
> > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > 
> > > 
> > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > cleanly, see:
> > > > > 
> > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > 
> > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > control (without getting netdev opened early by the other part of
> > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > which don't provides a solution if user care about consistent naming
> > > > > on the slave netdevs specifically.
> > > > > 
> > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > Above says:
> > > > 
> > > >     there's no motivation in the systemd/udevd community at
> > > >     this point to refactor the rename logic and make it work well with
> > > >     3-netdev.
> > > > 
> > > > What would the fix be? Skip slave devices?
> > > > 
> > > There's nothing user can get if just skipping slave devices - the
> > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > next reboot, while the rest may conform to the naming scheme (ens3
> > > and such). There's no way one can fix this in userspace alone - when
> > > the failover is created the enslaved netdev was opened by the kernel
> > > earlier than the userspace is made aware of, and there's no
> > > negotiation protocol for kernel to know when userspace has done
> > > initial renaming of the interface. I would expect netdev list should
> > > at least provide the direction in general for how this can be
> > > solved...


I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?

Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.






> > > 
> > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > to only work with the master failover device.
> Where does this expectation come from?
> 
> Admin users may have ethtool or tc configurations that need to deal with
> predictable interface name. Third-party app which was built upon specifying
> certain interface name can't be modified to chase dynamic names.
> 
> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> offload settings post boot for specific workload. Those images won't work
> well if the name is constantly changing just after couple rounds of live
> migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.

> > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> model as much transparent to a real NIC as possible, while a hidden netns is
> just the vehicle). However, I recall there was resistance around this
> discussion that even the concept of hiding itself is a taboo for Linux
> netdev. I would like to summon potential alternatives before concluding
> 1-netdev is the only solution too soon.
> 
> Thanks,
> -Siwei

Your scripts would not work at all then, right?


> > 
> > > -Siwei
> > > 
> > > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                   ` <e6a53bd1-83ab-f170-406a-03276e8c87e2@oracle.com>
@ 2019-02-26  1:39                     ` Stephen Hemminger
       [not found]                     ` <20190225173912.26b93422@shemminger-XPS-13-9360>
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Stephen Hemminger @ 2019-02-26  1:39 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	liran.alon, Netdev, David Miller

On Mon, 25 Feb 2019 16:58:07 -0800
si-wei liu <si-wei.liu@oracle.com> wrote:

> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> >>
> >> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> >>>
> >>> On 2/21/2019 7:33 PM, si-wei liu wrote:  
> >>>>
> >>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> >>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> >>>>>> Sorry for replying to this ancient thread. There was some remaining
> >>>>>> issue that I don't think the initial net_failover patch got addressed
> >>>>>> cleanly, see:
> >>>>>>
> >>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> >>>>>>
> >>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> >>>>>> not specifically writtten for such kernel automatic enslavement.
> >>>>>> Specifically, if it is a bond or team, the slave would typically get
> >>>>>> renamed *before* virtual device gets created, that's what udev can
> >>>>>> control (without getting netdev opened early by the other part of
> >>>>>> kernel) and other userspace components for e.g. initramfs,
> >>>>>> init-scripts can coordinate well in between. The in-kernel
> >>>>>> auto-enslavement of net_failover breaks this userspace convention,
> >>>>>> which don't provides a solution if user care about consistent naming
> >>>>>> on the slave netdevs specifically.
> >>>>>>
> >>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
> >>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
> >>>>>> problem ever since. Please share your mind how to proceed and solve
> >>>>>> this userspace issue if netdev does not welcome a 1-netdev model.  
> >>>>> Above says:
> >>>>>
> >>>>>      there's no motivation in the systemd/udevd community at
> >>>>>      this point to refactor the rename logic and make it work well with
> >>>>>      3-netdev.
> >>>>>
> >>>>> What would the fix be? Skip slave devices?
> >>>>>  
> >>>> There's nothing user can get if just skipping slave devices - the
> >>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
> >>>> next reboot, while the rest may conform to the naming scheme (ens3
> >>>> and such). There's no way one can fix this in userspace alone - when
> >>>> the failover is created the enslaved netdev was opened by the kernel
> >>>> earlier than the userspace is made aware of, and there's no
> >>>> negotiation protocol for kernel to know when userspace has done
> >>>> initial renaming of the interface. I would expect netdev list should
> >>>> at least provide the direction in general for how this can be
> >>>> solved...  
> >
> > I was just wondering what did you mean when you said
> > "refactor the rename logic and make it work well with 3-netdev" -
> > was there a proposal udev rejected?  
> No. I never believed this particular issue can be fixed in userspace 
> alone. Previously someone had said it could be, but I never see any work 
> or relevant discussion ever happened in various userspace communities 
> (for e.g. dracut, initramfs-tools, systemd, udev, and NetworkManager). 
> IMHO the root of the issue derives from the kernel, it makes more sense 
> to start from netdev, work out and decide on a solution: see what can be 
> done in the kernel in order to fix it, then after that engage userspace 
> community for the feasibility...
> 
> > Anyway, can we write a time diagram for what happens in which order that
> > leads to failure?  That would help look for triggers that we can tie
> > into, or add new ones.
> >  
> 
> See attached diagram.
> 
> >
> >
> >
> >  
> >>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> >>> to only work with the master failover device.  
> >> Where does this expectation come from?
> >>
> >> Admin users may have ethtool or tc configurations that need to deal with
> >> predictable interface name. Third-party app which was built upon specifying
> >> certain interface name can't be modified to chase dynamic names.
> >>
> >> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> >> offload settings post boot for specific workload. Those images won't work
> >> well if the name is constantly changing just after couple rounds of live
> >> migration.  
> > It should be possible to specify the ethtool configuration on the
> > master and have it automatically propagated to the slave.
> >
> > BTW this is something we should look at IMHO.  
> I was elaborating a few examples that the expectation and assumption 
> that user/admin scripts only deal with master failover device is 
> incorrect. It had never been taken good care of, although I did try to 
> emphasize it from the very beginning.
> 
> Basically what you said about propagating the ethtool configuration down 
> to the slave is the key pursuance of 1-netdev model. However, what I am 
> seeking now is any alternative that can also fix the specific udev 
> rename problem, before concluding that 1-netdev is the only solution. 
> Generally a 1-netdev scheme would take time to implement, while I'm 
> trying to find a way out to fix this particular naming problem under 
> 3-netdev.
> 
> >  
> >>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> >>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> >>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> >>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> >> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> >> model as much transparent to a real NIC as possible, while a hidden netns is
> >> just the vehicle). However, I recall there was resistance around this
> >> discussion that even the concept of hiding itself is a taboo for Linux
> >> netdev. I would like to summon potential alternatives before concluding
> >> 1-netdev is the only solution too soon.
> >>
> >> Thanks,
> >> -Siwei  
> > Your scripts would not work at all then, right?  
> At this point we don't claim images with such usage as SR-IOV live 
> migrate-able. We would flag it as live migrate-able until this ethtool 
> config issue is fully addressed and a transparent live migration 
> solution emerges in upstream eventually.

The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its rename.
I proposed a patch to key state change off of the udev rename, but that patch was
rejected.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                     ` <20190225173912.26b93422@shemminger-XPS-13-9360>
@ 2019-02-26  2:05                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-26  2:05 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	si-wei liu, David Miller

On Mon, Feb 25, 2019 at 05:39:12PM -0800, Stephen Hemminger wrote:
> > >>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > >>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > >>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > >>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> > >> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > >> model as much transparent to a real NIC as possible, while a hidden netns is
> > >> just the vehicle). However, I recall there was resistance around this
> > >> discussion that even the concept of hiding itself is a taboo for Linux
> > >> netdev. I would like to summon potential alternatives before concluding
> > >> 1-netdev is the only solution too soon.
> > >>
> > >> Thanks,
> > >> -Siwei  
> > > Your scripts would not work at all then, right?  
> > At this point we don't claim images with such usage as SR-IOV live 
> > migrate-able. We would flag it as live migrate-able until this ethtool 
> > config issue is fully addressed and a transparent live migration 
> > solution emerges in upstream eventually.
> 
> The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its rename.
> I proposed a patch to key state change off of the udev rename, but that patch was
> rejected.

Of course that would mean nothing works without udev - was
that the objection? Could you help me find that discussion pls?

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                   ` <e6a53bd1-83ab-f170-406a-03276e8c87e2@oracle.com>
  2019-02-26  1:39                     ` Stephen Hemminger
       [not found]                     ` <20190225173912.26b93422@shemminger-XPS-13-9360>
@ 2019-02-26  2:08                     ` Michael S. Tsirkin
       [not found]                     ` <20190225210529-mutt-send-email-mst@kernel.org>
  3 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-26  2:08 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	David Miller

On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> 
> 
> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > 
> > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > 
> > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > 
> > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > cleanly, see:
> > > > > > > 
> > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > 
> > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > on the slave netdevs specifically.
> > > > > > > 
> > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > Above says:
> > > > > > 
> > > > > >      there's no motivation in the systemd/udevd community at
> > > > > >      this point to refactor the rename logic and make it work well with
> > > > > >      3-netdev.
> > > > > > 
> > > > > > What would the fix be? Skip slave devices?
> > > > > > 
> > > > > There's nothing user can get if just skipping slave devices - the
> > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > earlier than the userspace is made aware of, and there's no
> > > > > negotiation protocol for kernel to know when userspace has done
> > > > > initial renaming of the interface. I would expect netdev list should
> > > > > at least provide the direction in general for how this can be
> > > > > solved...
> > 
> > I was just wondering what did you mean when you said
> > "refactor the rename logic and make it work well with 3-netdev" -
> > was there a proposal udev rejected?
> No. I never believed this particular issue can be fixed in userspace alone.
> Previously someone had said it could be, but I never see any work or
> relevant discussion ever happened in various userspace communities (for e.g.
> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> of the issue derives from the kernel, it makes more sense to start from
> netdev, work out and decide on a solution: see what can be done in the
> kernel in order to fix it, then after that engage userspace community for
> the feasibility...
> 
> > Anyway, can we write a time diagram for what happens in which order that
> > leads to failure?  That would help look for triggers that we can tie
> > into, or add new ones.
> > 
> 
> See attached diagram.
> 
> > 
> > 
> > 
> > 
> > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > to only work with the master failover device.
> > > Where does this expectation come from?
> > > 
> > > Admin users may have ethtool or tc configurations that need to deal with
> > > predictable interface name. Third-party app which was built upon specifying
> > > certain interface name can't be modified to chase dynamic names.
> > > 
> > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > offload settings post boot for specific workload. Those images won't work
> > > well if the name is constantly changing just after couple rounds of live
> > > migration.
> > It should be possible to specify the ethtool configuration on the
> > master and have it automatically propagated to the slave.
> > 
> > BTW this is something we should look at IMHO.
> I was elaborating a few examples that the expectation and assumption that
> user/admin scripts only deal with master failover device is incorrect. It
> had never been taken good care of, although I did try to emphasize it from
> the very beginning.
> 
> Basically what you said about propagating the ethtool configuration down to
> the slave is the key pursuance of 1-netdev model. However, what I am seeking
> now is any alternative that can also fix the specific udev rename problem,
> before concluding that 1-netdev is the only solution. Generally a 1-netdev
> scheme would take time to implement, while I'm trying to find a way out to
> fix this particular naming problem under 3-netdev.
> 
> > 
> > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > just the vehicle). However, I recall there was resistance around this
> > > discussion that even the concept of hiding itself is a taboo for Linux
> > > netdev. I would like to summon potential alternatives before concluding
> > > 1-netdev is the only solution too soon.
> > > 
> > > Thanks,
> > > -Siwei
> > Your scripts would not work at all then, right?
> At this point we don't claim images with such usage as SR-IOV live
> migrate-able. We would flag it as live migrate-able until this ethtool
> config issue is fully addressed and a transparent live migration solution
> emerges in upstream eventually.
> 
> 
> Thanks,
> -Siwei
> > 
> > 
> > > > > -Siwei
> > > > > 
> > > > > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > 
> 

> 
>   net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> --------------------------------------------------+------------------------------+--------------------------------------------
> (standby virtio-net and net_failover              |                              |
> devices created and initialized,                  |                              |
> i.e. virtnet_probe()->                            |                              |
>        net_failover_create()                      |                              |
> was done.)                                        |                              |
>                                                   |                              |
>                                                   |  runs `ifup ens3' ->         |
>                                                   |    ip link set dev ens3 up   |
> net_failover_open()                               |                              |
>   dev_open(virtnet_dev)                           |                              |
>     virtnet_open(virtnet_dev)                     |                              |
>   netif_carrier_on(failover_dev)                  |                              |
>   ...                                             |                              |
>                                                   |                              |
> (VF hot plugged in)                               |                              |
> ixgbevf_probe()                                   |                              |
>  register_netdev(ixgbevf_netdev)                  |                              |
>   netdev_register_kobject(ixgbevf_netdev)         |                              |
>    kobject_add(ixgbevf_dev)                       |                              |
>     device_add(ixgbevf_dev)                       |                              |
>      kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>       netlink_broadcast()                         |                              |
>   ...                                             |                              |
>   call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>    failover_event(..., NETDEV_REGISTER, ...)      |                              |
>     failover_slave_register(ixgbevf_netdev)       |                              |
>      net_failover_slave_register(ixgbevf_netdev)  |                              |
>       dev_open(ixgbevf_netdev)                    |                              |
>                                                   |                              |
>                                                   |                              |
>                                                   |                              |   received ADD uevent from netlink fd
>                                                   |                              |   ...
>                                                   |                              |   udev-builtin-net_id.c:dev_pci_slot()
>                                                   |                              |   (decided to renamed 'eth0' )
>                                                   |                              |     ip link set dev eth0 name ens4
> (dev_change_name() returns -EBUSY as              |                              |
> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>                                                   |                              |
> 

Given renaming slaves does not work anyway: would it work if we just
hard-coded slave names instead?

E.g.
1. fail slave renames
2. rename of failover to XX automatically renames standby to XXnsby
   and primary to XXnpry


-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                       ` <d1060c75-eaba-ab6f-ff31-38cb3a47c711@oracle.com>
@ 2019-02-27 21:57                         ` Stephen Hemminger
  2019-02-27 22:38                         ` Michael S. Tsirkin
       [not found]                         ` <20190227173710-mutt-send-email-mst@kernel.org>
  2 siblings, 0 replies; 47+ messages in thread
From: Stephen Hemminger @ 2019-02-27 21:57 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	liran.alon, Netdev, David Miller

On Tue, 26 Feb 2019 16:17:21 -0800
si-wei liu <si-wei.liu@oracle.com> wrote:

> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:  
> >>
> >> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> >>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> >>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:  
> >>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> >>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> >>>>>>>> Sorry for replying to this ancient thread. There was some remaining
> >>>>>>>> issue that I don't think the initial net_failover patch got addressed
> >>>>>>>> cleanly, see:
> >>>>>>>>
> >>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> >>>>>>>>
> >>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> >>>>>>>> not specifically writtten for such kernel automatic enslavement.
> >>>>>>>> Specifically, if it is a bond or team, the slave would typically get
> >>>>>>>> renamed *before* virtual device gets created, that's what udev can
> >>>>>>>> control (without getting netdev opened early by the other part of
> >>>>>>>> kernel) and other userspace components for e.g. initramfs,
> >>>>>>>> init-scripts can coordinate well in between. The in-kernel
> >>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
> >>>>>>>> which don't provides a solution if user care about consistent naming
> >>>>>>>> on the slave netdevs specifically.
> >>>>>>>>
> >>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
> >>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
> >>>>>>>> problem ever since. Please share your mind how to proceed and solve
> >>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.  
> >>>>>>> Above says:
> >>>>>>>
> >>>>>>>       there's no motivation in the systemd/udevd community at
> >>>>>>>       this point to refactor the rename logic and make it work well with
> >>>>>>>       3-netdev.
> >>>>>>>
> >>>>>>> What would the fix be? Skip slave devices?
> >>>>>>>  
> >>>>>> There's nothing user can get if just skipping slave devices - the
> >>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
> >>>>>> next reboot, while the rest may conform to the naming scheme (ens3
> >>>>>> and such). There's no way one can fix this in userspace alone - when
> >>>>>> the failover is created the enslaved netdev was opened by the kernel
> >>>>>> earlier than the userspace is made aware of, and there's no
> >>>>>> negotiation protocol for kernel to know when userspace has done
> >>>>>> initial renaming of the interface. I would expect netdev list should
> >>>>>> at least provide the direction in general for how this can be
> >>>>>> solved...  
> >>> I was just wondering what did you mean when you said
> >>> "refactor the rename logic and make it work well with 3-netdev" -
> >>> was there a proposal udev rejected?  
> >> No. I never believed this particular issue can be fixed in userspace alone.
> >> Previously someone had said it could be, but I never see any work or
> >> relevant discussion ever happened in various userspace communities (for e.g.
> >> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> >> of the issue derives from the kernel, it makes more sense to start from
> >> netdev, work out and decide on a solution: see what can be done in the
> >> kernel in order to fix it, then after that engage userspace community for
> >> the feasibility...
> >>  
> >>> Anyway, can we write a time diagram for what happens in which order that
> >>> leads to failure?  That would help look for triggers that we can tie
> >>> into, or add new ones.
> >>>  
> >> See attached diagram.
> >>  
> >>>
> >>>
> >>>  
> >>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> >>>>> to only work with the master failover device.  
> >>>> Where does this expectation come from?
> >>>>
> >>>> Admin users may have ethtool or tc configurations that need to deal with
> >>>> predictable interface name. Third-party app which was built upon specifying
> >>>> certain interface name can't be modified to chase dynamic names.
> >>>>
> >>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> >>>> offload settings post boot for specific workload. Those images won't work
> >>>> well if the name is constantly changing just after couple rounds of live
> >>>> migration.  
> >>> It should be possible to specify the ethtool configuration on the
> >>> master and have it automatically propagated to the slave.
> >>>
> >>> BTW this is something we should look at IMHO.  
> >> I was elaborating a few examples that the expectation and assumption that
> >> user/admin scripts only deal with master failover device is incorrect. It
> >> had never been taken good care of, although I did try to emphasize it from
> >> the very beginning.
> >>
> >> Basically what you said about propagating the ethtool configuration down to
> >> the slave is the key pursuance of 1-netdev model. However, what I am seeking
> >> now is any alternative that can also fix the specific udev rename problem,
> >> before concluding that 1-netdev is the only solution. Generally a 1-netdev
> >> scheme would take time to implement, while I'm trying to find a way out to
> >> fix this particular naming problem under 3-netdev.
> >>  
> >>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> >>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> >>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> >>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> >>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> >>>> model as much transparent to a real NIC as possible, while a hidden netns is
> >>>> just the vehicle). However, I recall there was resistance around this
> >>>> discussion that even the concept of hiding itself is a taboo for Linux
> >>>> netdev. I would like to summon potential alternatives before concluding
> >>>> 1-netdev is the only solution too soon.
> >>>>
> >>>> Thanks,
> >>>> -Siwei  
> >>> Your scripts would not work at all then, right?  
> >> At this point we don't claim images with such usage as SR-IOV live
> >> migrate-able. We would flag it as live migrate-able until this ethtool
> >> config issue is fully addressed and a transparent live migration solution
> >> emerges in upstream eventually.
> >>
> >>
> >> Thanks,
> >> -Siwei  
> >>>  
> >>>>>> -Siwei
> >>>>>>
> >>>>>>  
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> >>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >>>  
> >>    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> >> --------------------------------------------------+------------------------------+--------------------------------------------
> >> (standby virtio-net and net_failover              |                              |
> >> devices created and initialized,                  |                              |
> >> i.e. virtnet_probe()->                            |                              |
> >>         net_failover_create()                      |                              |
> >> was done.)                                        |                              |
> >>                                                    |                              |
> >>                                                    |  runs `ifup ens3' ->         |
> >>                                                    |    ip link set dev ens3 up   |
> >> net_failover_open()                               |                              |
> >>    dev_open(virtnet_dev)                           |                              |
> >>      virtnet_open(virtnet_dev)                     |                              |
> >>    netif_carrier_on(failover_dev)                  |                              |
> >>    ...                                             |                              |
> >>                                                    |                              |
> >> (VF hot plugged in)                               |                              |
> >> ixgbevf_probe()                                   |                              |
> >>   register_netdev(ixgbevf_netdev)                  |                              |
> >>    netdev_register_kobject(ixgbevf_netdev)         |                              |
> >>     kobject_add(ixgbevf_dev)                       |                              |
> >>      device_add(ixgbevf_dev)                       |                              |
> >>       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> >>        netlink_broadcast()                         |                              |
> >>    ...                                             |                              |
> >>    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> >>     failover_event(..., NETDEV_REGISTER, ...)      |                              |
> >>      failover_slave_register(ixgbevf_netdev)       |                              |
> >>       net_failover_slave_register(ixgbevf_netdev)  |                              |
> >>        dev_open(ixgbevf_netdev)                    |                              |
> >>                                                    |                              |
> >>                                                    |                              |
> >>                                                    |                              |   received ADD uevent from netlink fd
> >>                                                    |                              |   ...
> >>                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
> >>                                                    |                              |   (decided to renamed 'eth0' )
> >>                                                    |                              |     ip link set dev eth0 name ens4
> >> (dev_change_name() returns -EBUSY as              |                              |
> >> ixgbevf_netdev->flags has IFF_UP)                 |                              |
> >>                                                    |                              |
> >>  
> > Given renaming slaves does not work anyway:  
> I was actually thinking what if we relieve the rename restriction just 
> for the failover slave? What the impact would be? I think users don't 
> care about slave being renamed when it's in use, especially the initial 
> rename. Thoughts?
> 
> >   would it work if we just
> > hard-coded slave names instead?
> >
> > E.g.
> > 1. fail slave renames
> > 2. rename of failover to XX automatically renames standby to XXnsby
> >     and primary to XXnpry  
> That wouldn't help. The time when the failover master gets renamed, the 
> VF may not be present. I don't like the idea to delay exposing failover 
> master until VF is hot plugged in (probably subject to various failures) 
> later.


What netvsc does now is wait 2 seconds (to allow udev to do rename)
before bringing the VF link up. This works, has had no problems even
with slow distributions and is widely used.

A patch to allow ending the timeout after rename was proposed but
rejected.

https://lore.kernel.org/netdev/20171220223323.21125-1-sthemmin@microsoft.com/

Allow network devices to change name when up is too risky. There are things
like netfilter rules and other state in and out of the kernel that may break.
Userspace does not like it when the rules change.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                       ` <d1060c75-eaba-ab6f-ff31-38cb3a47c711@oracle.com>
  2019-02-27 21:57                         ` Stephen Hemminger
@ 2019-02-27 22:38                         ` Michael S. Tsirkin
       [not found]                         ` <20190227173710-mutt-send-email-mst@kernel.org>
  2 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-27 22:38 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	David Miller

On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> 
> 
> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > 
> > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > cleanly, see:
> > > > > > > > > 
> > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > 
> > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > 
> > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > Above says:
> > > > > > > > 
> > > > > > > >       there's no motivation in the systemd/udevd community at
> > > > > > > >       this point to refactor the rename logic and make it work well with
> > > > > > > >       3-netdev.
> > > > > > > > 
> > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > 
> > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > at least provide the direction in general for how this can be
> > > > > > > solved...
> > > > I was just wondering what did you mean when you said
> > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > was there a proposal udev rejected?
> > > No. I never believed this particular issue can be fixed in userspace alone.
> > > Previously someone had said it could be, but I never see any work or
> > > relevant discussion ever happened in various userspace communities (for e.g.
> > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > of the issue derives from the kernel, it makes more sense to start from
> > > netdev, work out and decide on a solution: see what can be done in the
> > > kernel in order to fix it, then after that engage userspace community for
> > > the feasibility...
> > > 
> > > > Anyway, can we write a time diagram for what happens in which order that
> > > > leads to failure?  That would help look for triggers that we can tie
> > > > into, or add new ones.
> > > > 
> > > See attached diagram.
> > > 
> > > > 
> > > > 
> > > > 
> > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > to only work with the master failover device.
> > > > > Where does this expectation come from?
> > > > > 
> > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > certain interface name can't be modified to chase dynamic names.
> > > > > 
> > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > offload settings post boot for specific workload. Those images won't work
> > > > > well if the name is constantly changing just after couple rounds of live
> > > > > migration.
> > > > It should be possible to specify the ethtool configuration on the
> > > > master and have it automatically propagated to the slave.
> > > > 
> > > > BTW this is something we should look at IMHO.
> > > I was elaborating a few examples that the expectation and assumption that
> > > user/admin scripts only deal with master failover device is incorrect. It
> > > had never been taken good care of, although I did try to emphasize it from
> > > the very beginning.
> > > 
> > > Basically what you said about propagating the ethtool configuration down to
> > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > now is any alternative that can also fix the specific udev rename problem,
> > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > scheme would take time to implement, while I'm trying to find a way out to
> > > fix this particular naming problem under 3-netdev.
> > > 
> > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > just the vehicle). However, I recall there was resistance around this
> > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > 1-netdev is the only solution too soon.
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > Your scripts would not work at all then, right?
> > > At this point we don't claim images with such usage as SR-IOV live
> > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > config issue is fully addressed and a transparent live migration solution
> > > emerges in upstream eventually.
> > > 
> > > 
> > > Thanks,
> > > -Siwei
> > > > 
> > > > > > > -Siwei
> > > > > > > 
> > > > > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > 
> > >    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > (standby virtio-net and net_failover              |                              |
> > > devices created and initialized,                  |                              |
> > > i.e. virtnet_probe()->                            |                              |
> > >         net_failover_create()                      |                              |
> > > was done.)                                        |                              |
> > >                                                    |                              |
> > >                                                    |  runs `ifup ens3' ->         |
> > >                                                    |    ip link set dev ens3 up   |
> > > net_failover_open()                               |                              |
> > >    dev_open(virtnet_dev)                           |                              |
> > >      virtnet_open(virtnet_dev)                     |                              |
> > >    netif_carrier_on(failover_dev)                  |                              |
> > >    ...                                             |                              |
> > >                                                    |                              |
> > > (VF hot plugged in)                               |                              |
> > > ixgbevf_probe()                                   |                              |
> > >   register_netdev(ixgbevf_netdev)                  |                              |
> > >    netdev_register_kobject(ixgbevf_netdev)         |                              |
> > >     kobject_add(ixgbevf_dev)                       |                              |
> > >      device_add(ixgbevf_dev)                       |                              |
> > >       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > >        netlink_broadcast()                         |                              |
> > >    ...                                             |                              |
> > >    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > >     failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > >      failover_slave_register(ixgbevf_netdev)       |                              |
> > >       net_failover_slave_register(ixgbevf_netdev)  |                              |
> > >        dev_open(ixgbevf_netdev)                    |                              |
> > >                                                    |                              |
> > >                                                    |                              |
> > >                                                    |                              |   received ADD uevent from netlink fd
> > >                                                    |                              |   ...
> > >                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > >                                                    |                              |   (decided to renamed 'eth0' )
> > >                                                    |                              |     ip link set dev eth0 name ens4
> > > (dev_change_name() returns -EBUSY as              |                              |
> > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > >                                                    |                              |
> > > 
> > Given renaming slaves does not work anyway:
> I was actually thinking what if we relieve the rename restriction just for
> the failover slave? What the impact would be? I think users don't care about
> slave being renamed when it's in use, especially the initial rename.
> Thoughts?
> 
> >   would it work if we just
> > hard-coded slave names instead?
> > 
> > E.g.
> > 1. fail slave renames
> > 2. rename of failover to XX automatically renames standby to XXnsby
> >     and primary to XXnpry
> That wouldn't help. The time when the failover master gets renamed, the VF
> may not be present.

In this scheme if VF is not there it will be renamed immediately after registration.

> I don't like the idea to delay exposing failover master
> until VF is hot plugged in (probably subject to various failures) later.
> 
> Thanks,
> -Siwei


I agree, this was not what I meant.

> > 
> > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                           ` <c72ce9eb-254c-cc3e-1969-f7f108506d5e@oracle.com>
@ 2019-02-27 23:50                             ` Michael S. Tsirkin
       [not found]                             ` <20190227184601-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-27 23:50 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	David Miller

On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> 
> 
> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
> > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> > > 
> > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > cleanly, see:
> > > > > > > > > > > 
> > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > 
> > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > 
> > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > > > Above says:
> > > > > > > > > > 
> > > > > > > > > >        there's no motivation in the systemd/udevd community at
> > > > > > > > > >        this point to refactor the rename logic and make it work well with
> > > > > > > > > >        3-netdev.
> > > > > > > > > > 
> > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > 
> > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > solved...
> > > > > > I was just wondering what did you mean when you said
> > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > was there a proposal udev rejected?
> > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > Previously someone had said it could be, but I never see any work or
> > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > the feasibility...
> > > > > 
> > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > into, or add new ones.
> > > > > > 
> > > > > See attached diagram.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > to only work with the master failover device.
> > > > > > > Where does this expectation come from?
> > > > > > > 
> > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > 
> > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > migration.
> > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > master and have it automatically propagated to the slave.
> > > > > > 
> > > > > > BTW this is something we should look at IMHO.
> > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > the very beginning.
> > > > > 
> > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > fix this particular naming problem under 3-netdev.
> > > > > 
> > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > 1-netdev is the only solution too soon.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > -Siwei
> > > > > > Your scripts would not work at all then, right?
> > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > config issue is fully addressed and a transparent live migration solution
> > > > > emerges in upstream eventually.
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > > > > > > -Siwei
> > > > > > > > > 
> > > > > > > > > 
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > 
> > > > >     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > (standby virtio-net and net_failover              |                              |
> > > > > devices created and initialized,                  |                              |
> > > > > i.e. virtnet_probe()->                            |                              |
> > > > >          net_failover_create()                      |                              |
> > > > > was done.)                                        |                              |
> > > > >                                                     |                              |
> > > > >                                                     |  runs `ifup ens3' ->         |
> > > > >                                                     |    ip link set dev ens3 up   |
> > > > > net_failover_open()                               |                              |
> > > > >     dev_open(virtnet_dev)                           |                              |
> > > > >       virtnet_open(virtnet_dev)                     |                              |
> > > > >     netif_carrier_on(failover_dev)                  |                              |
> > > > >     ...                                             |                              |
> > > > >                                                     |                              |
> > > > > (VF hot plugged in)                               |                              |
> > > > > ixgbevf_probe()                                   |                              |
> > > > >    register_netdev(ixgbevf_netdev)                  |                              |
> > > > >     netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > >      kobject_add(ixgbevf_dev)                       |                              |
> > > > >       device_add(ixgbevf_dev)                       |                              |
> > > > >        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > >         netlink_broadcast()                         |                              |
> > > > >     ...                                             |                              |
> > > > >     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > >      failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > >       failover_slave_register(ixgbevf_netdev)       |                              |
> > > > >        net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > >         dev_open(ixgbevf_netdev)                    |                              |
> > > > >                                                     |                              |
> > > > >                                                     |                              |
> > > > >                                                     |                              |   received ADD uevent from netlink fd
> > > > >                                                     |                              |   ...
> > > > >                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > >                                                     |                              |   (decided to renamed 'eth0' )
> > > > >                                                     |                              |     ip link set dev eth0 name ens4
> > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > >                                                     |                              |
> > > > > 
> > > > Given renaming slaves does not work anyway:
> > > I was actually thinking what if we relieve the rename restriction just for
> > > the failover slave? What the impact would be? I think users don't care about
> > > slave being renamed when it's in use, especially the initial rename.
> > > Thoughts?
> > > 
> > > >    would it work if we just
> > > > hard-coded slave names instead?
> > > > 
> > > > E.g.
> > > > 1. fail slave renames
> > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > >      and primary to XXnpry
> > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > may not be present.
> > In this scheme if VF is not there it will be renamed immediately after registration.
> Who will be responsible to rename the slave, the kernel?

That's the idea.

> Note the master's
> name may or may not come from the userspace. If it comes from the userspace,
> should the userspace daemon change their expectation not to name/rename
> _any_ slaves (today there's no distinction)?

Yes the idea would be to fail renaming slaves.

> How do users know which name to
> trust, depending on which wins the race more often? Say if kernel wants a
> ens3npry name while userspace wants it named as ens4.
> 
> -Siwei

With this approach kernel will deny attempts by userspace to rename
slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
will rename both slaves.

It seems pretty solid to me, the only issue is that in theory userspace
can use a name like XXXnsby for something else. But this seems unlikely.


> > 
> > > I don't like the idea to delay exposing failover master
> > > until VF is hot plugged in (probably subject to various failures) later.
> > > 
> > > Thanks,
> > > -Siwei
> > 
> > I agree, this was not what I meant.
> > 
> > > > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                             ` <20190227184601-mutt-send-email-mst@kernel.org>
@ 2019-02-28  0:00                               ` Liran Alon
  2019-02-28  0:03                               ` Stephen Hemminger
                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Liran Alon @ 2019-02-28  0:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, Netdev, si-wei liu,
	David Miller



> On 28 Feb 2019, at 1:50, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>> 
>> 
>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>> 
>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>> 
>>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_linux_-2Bbug_1815268&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=aL-QfUoSYx8r0XCOBkcDtF8f-cYxrJI3skYLFTb8XJE&s=yk6Nqv3a6_JMzyrXKY67h00FyNrDJyQ-PYMFffDSTXM&e=
>>>>>>>>>>>> 
>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>> 
>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>> Above says:
>>>>>>>>>>> 
>>>>>>>>>>>       there's no motivation in the systemd/udevd community at
>>>>>>>>>>>       this point to refactor the rename logic and make it work well with
>>>>>>>>>>>       3-netdev.
>>>>>>>>>>> 
>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>> 
>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>> solved...
>>>>>>> I was just wondering what did you mean when you said
>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>> was there a proposal udev rejected?
>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>> the feasibility...
>>>>>> 
>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>> into, or add new ones.
>>>>>>> 
>>>>>> See attached diagram.
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>> to only work with the master failover device.
>>>>>>>> Where does this expectation come from?
>>>>>>>> 
>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>> 
>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>> migration.
>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>> master and have it automatically propagated to the slave.
>>>>>>> 
>>>>>>> BTW this is something we should look at IMHO.
>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>> the very beginning.
>>>>>> 
>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>> fix this particular naming problem under 3-netdev.
>>>>>> 
>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>> Your scripts would not work at all then, right?
>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>> emerges in upstream eventually.
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>>>>> -Siwei
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>> 
>>>>>>    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>> devices created and initialized,                  |                              |
>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>         net_failover_create()                      |                              |
>>>>>> was done.)                                        |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |  runs `ifup ens3' ->         |
>>>>>>                                                    |    ip link set dev ens3 up   |
>>>>>> net_failover_open()                               |                              |
>>>>>>    dev_open(virtnet_dev)                           |                              |
>>>>>>      virtnet_open(virtnet_dev)                     |                              |
>>>>>>    netif_carrier_on(failover_dev)                  |                              |
>>>>>>    ...                                             |                              |
>>>>>>                                                    |                              |
>>>>>> (VF hot plugged in)                               |                              |
>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>   register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>    netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>     kobject_add(ixgbevf_dev)                       |                              |
>>>>>>      device_add(ixgbevf_dev)                       |                              |
>>>>>>       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>        netlink_broadcast()                         |                              |
>>>>>>    ...                                             |                              |
>>>>>>    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>     failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>      failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>       net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>        dev_open(ixgbevf_netdev)                    |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |                              |   received ADD uevent from netlink fd
>>>>>>                                                    |                              |   ...
>>>>>>                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>                                                    |                              |   (decided to renamed 'eth0' )
>>>>>>                                                    |                              |     ip link set dev eth0 name ens4
>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>                                                    |                              |
>>>>>> 
>>>>> Given renaming slaves does not work anyway:
>>>> I was actually thinking what if we relieve the rename restriction just for
>>>> the failover slave? What the impact would be? I think users don't care about
>>>> slave being renamed when it's in use, especially the initial rename.
>>>> Thoughts?
>>>> 
>>>>>   would it work if we just
>>>>> hard-coded slave names instead?
>>>>> 
>>>>> E.g.
>>>>> 1. fail slave renames
>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>     and primary to XXnpry
>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>> may not be present.
>>> In this scheme if VF is not there it will be renamed immediately after registration.
>> Who will be responsible to rename the slave, the kernel?
> 
> That's the idea.
> 
>> Note the master's
>> name may or may not come from the userspace. If it comes from the userspace,
>> should the userspace daemon change their expectation not to name/rename
>> _any_ slaves (today there's no distinction)?
> 
> Yes the idea would be to fail renaming slaves.
> 
>> How do users know which name to
>> trust, depending on which wins the race more often? Say if kernel wants a
>> ens3npry name while userspace wants it named as ens4.
>> 
>> -Siwei
> 
> With this approach kernel will deny attempts by userspace to rename
> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
> 
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.

I’m fond of this idea and I have similar opinion.
I think it simplifies the issue here.
I don’t see a real reason for customer to define udev rule to rename a net-failover slave to have different postfix.

-Liran

> 
> 
>>> 
>>>> I don't like the idea to delay exposing failover master
>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>> 
>>>> Thanks,
>>>> -Siwei
>>> 
>>> I agree, this was not what I meant.
>>> 
>>>>> 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                             ` <20190227184601-mutt-send-email-mst@kernel.org>
  2019-02-28  0:00                               ` Liran Alon
@ 2019-02-28  0:03                               ` Stephen Hemminger
       [not found]                               ` <20190227160342.788dc2b4@shemminger-XPS-13-9360>
       [not found]                               ` <a617ce13-4114-469d-ef33-a1c91150eeca@oracle.com>
  3 siblings, 0 replies; 47+ messages in thread
From: Stephen Hemminger @ 2019-02-28  0:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	si-wei liu, David Miller

On Wed, 27 Feb 2019 18:50:44 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> > 
> > 
> > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:  
> > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:  
> > > > 
> > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:  
> > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:  
> > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:  
> > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:  
> > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > > cleanly, see:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > > 
> > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > > 
> > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.  
> > > > > > > > > > > Above says:
> > > > > > > > > > > 
> > > > > > > > > > >        there's no motivation in the systemd/udevd community at
> > > > > > > > > > >        this point to refactor the rename logic and make it work well with
> > > > > > > > > > >        3-netdev.
> > > > > > > > > > > 
> > > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > >   
> > > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > > solved...  
> > > > > > > I was just wondering what did you mean when you said
> > > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > > was there a proposal udev rejected?  
> > > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > > Previously someone had said it could be, but I never see any work or
> > > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > > the feasibility...
> > > > > >   
> > > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > > into, or add new ones.
> > > > > > >   
> > > > > > See attached diagram.
> > > > > >   
> > > > > > > 
> > > > > > >   
> > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > > to only work with the master failover device.  
> > > > > > > > Where does this expectation come from?
> > > > > > > > 
> > > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > > 
> > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > > migration.  
> > > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > > master and have it automatically propagated to the slave.
> > > > > > > 
> > > > > > > BTW this is something we should look at IMHO.  
> > > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > > the very beginning.
> > > > > > 
> > > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > > fix this particular naming problem under 3-netdev.
> > > > > >   
> > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > > 1-netdev is the only solution too soon.
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > -Siwei  
> > > > > > > Your scripts would not work at all then, right?  
> > > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > > config issue is fully addressed and a transparent live migration solution
> > > > > > emerges in upstream eventually.
> > > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > -Siwei  
> > > > > > > > > > -Siwei
> > > > > > > > > > 
> > > > > > > > > >   
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > >   
> > > > > >     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > > (standby virtio-net and net_failover              |                              |
> > > > > > devices created and initialized,                  |                              |
> > > > > > i.e. virtnet_probe()->                            |                              |
> > > > > >          net_failover_create()                      |                              |
> > > > > > was done.)                                        |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |  runs `ifup ens3' ->         |
> > > > > >                                                     |    ip link set dev ens3 up   |
> > > > > > net_failover_open()                               |                              |
> > > > > >     dev_open(virtnet_dev)                           |                              |
> > > > > >       virtnet_open(virtnet_dev)                     |                              |
> > > > > >     netif_carrier_on(failover_dev)                  |                              |
> > > > > >     ...                                             |                              |
> > > > > >                                                     |                              |
> > > > > > (VF hot plugged in)                               |                              |
> > > > > > ixgbevf_probe()                                   |                              |
> > > > > >    register_netdev(ixgbevf_netdev)                  |                              |
> > > > > >     netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > > >      kobject_add(ixgbevf_dev)                       |                              |
> > > > > >       device_add(ixgbevf_dev)                       |                              |
> > > > > >        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > > >         netlink_broadcast()                         |                              |
> > > > > >     ...                                             |                              |
> > > > > >     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > > >      failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > > >       failover_slave_register(ixgbevf_netdev)       |                              |
> > > > > >        net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > > >         dev_open(ixgbevf_netdev)                    |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |                              |   received ADD uevent from netlink fd
> > > > > >                                                     |                              |   ...
> > > > > >                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > > >                                                     |                              |   (decided to renamed 'eth0' )
> > > > > >                                                     |                              |     ip link set dev eth0 name ens4
> > > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > > >                                                     |                              |
> > > > > >   
> > > > > Given renaming slaves does not work anyway:  
> > > > I was actually thinking what if we relieve the rename restriction just for
> > > > the failover slave? What the impact would be? I think users don't care about
> > > > slave being renamed when it's in use, especially the initial rename.
> > > > Thoughts?
> > > >   
> > > > >    would it work if we just
> > > > > hard-coded slave names instead?
> > > > > 
> > > > > E.g.
> > > > > 1. fail slave renames
> > > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > > >      and primary to XXnpry  
> > > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > > may not be present.  
> > > In this scheme if VF is not there it will be renamed immediately after registration.  
> > Who will be responsible to rename the slave, the kernel?  
> 
> That's the idea.
> 
> > Note the master's
> > name may or may not come from the userspace. If it comes from the userspace,
> > should the userspace daemon change their expectation not to name/rename
> > _any_ slaves (today there's no distinction)?  
> 
> Yes the idea would be to fail renaming slaves.
> 
> > How do users know which name to
> > trust, depending on which wins the race more often? Say if kernel wants a
> > ens3npry name while userspace wants it named as ens4.
> > 
> > -Siwei  
> 
> With this approach kernel will deny attempts by userspace to rename
> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
> 
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.

Similar schemes (with kernel providing naming) were also previously rejected
upstream. It has been a consistent theme that the kernel should not be in
the renaming business. It will certainly break userspace.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                               ` <20190227160342.788dc2b4@shemminger-XPS-13-9360>
@ 2019-02-28  0:38                                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28  0:38 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	si-wei liu, David Miller

On Wed, Feb 27, 2019 at 04:03:42PM -0800, Stephen Hemminger wrote:
> > With this approach kernel will deny attempts by userspace to rename
> > slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> > will rename both slaves.
> > 
> > It seems pretty solid to me, the only issue is that in theory userspace
> > can use a name like XXXnsby for something else. But this seems unlikely.
> 
> Similar schemes (with kernel providing naming) were also previously rejected
> upstream.

Links?
I'm inclined to try and see what happens.

> It has been a consistent theme that the kernel should not be in
> the renaming business.

In this case it's not in renaming business per se. The only reason
we even have the original name is due to the ways internal APIs
work. You can look at it as simply having slaves names being
part of master.

> It will certainly break userspace.

That's a strong claim. What is it based on?  It so happens that
userspace renaming slaves is already broken on virtio. So we can fix it
any way we like :)

And yes it won't help netvsc because netvsc wants compatibility with old
scripts but then netvsc uses a 2 device model anyway.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                               ` <a617ce13-4114-469d-ef33-a1c91150eeca@oracle.com>
@ 2019-02-28  0:41                                 ` Michael S. Tsirkin
       [not found]                                 ` <20190227193923-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28  0:41 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	David Miller

On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:
> 
> 
> On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> > > 
> > > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
> > > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> > > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > > > cleanly, see:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > > > > > Above says:
> > > > > > > > > > > > 
> > > > > > > > > > > >         there's no motivation in the systemd/udevd community at
> > > > > > > > > > > >         this point to refactor the rename logic and make it work well with
> > > > > > > > > > > >         3-netdev.
> > > > > > > > > > > > 
> > > > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > > > 
> > > > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > > > solved...
> > > > > > > > I was just wondering what did you mean when you said
> > > > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > > > was there a proposal udev rejected?
> > > > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > > > Previously someone had said it could be, but I never see any work or
> > > > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > > > the feasibility...
> > > > > > > 
> > > > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > > > into, or add new ones.
> > > > > > > > 
> > > > > > > See attached diagram.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > > > to only work with the master failover device.
> > > > > > > > > Where does this expectation come from?
> > > > > > > > > 
> > > > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > > > 
> > > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > > > migration.
> > > > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > > > master and have it automatically propagated to the slave.
> > > > > > > > 
> > > > > > > > BTW this is something we should look at IMHO.
> > > > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > > > the very beginning.
> > > > > > > 
> > > > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > > > fix this particular naming problem under 3-netdev.
> > > > > > > 
> > > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > > > 1-netdev is the only solution too soon.
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > -Siwei
> > > > > > > > Your scripts would not work at all then, right?
> > > > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > > > config issue is fully addressed and a transparent live migration solution
> > > > > > > emerges in upstream eventually.
> > > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > -Siwei
> > > > > > > > > > > -Siwei
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > > > 
> > > > > > >      net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > > > (standby virtio-net and net_failover              |                              |
> > > > > > > devices created and initialized,                  |                              |
> > > > > > > i.e. virtnet_probe()->                            |                              |
> > > > > > >           net_failover_create()                      |                              |
> > > > > > > was done.)                                        |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |  runs `ifup ens3' ->         |
> > > > > > >                                                      |    ip link set dev ens3 up   |
> > > > > > > net_failover_open()                               |                              |
> > > > > > >      dev_open(virtnet_dev)                           |                              |
> > > > > > >        virtnet_open(virtnet_dev)                     |                              |
> > > > > > >      netif_carrier_on(failover_dev)                  |                              |
> > > > > > >      ...                                             |                              |
> > > > > > >                                                      |                              |
> > > > > > > (VF hot plugged in)                               |                              |
> > > > > > > ixgbevf_probe()                                   |                              |
> > > > > > >     register_netdev(ixgbevf_netdev)                  |                              |
> > > > > > >      netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > > > >       kobject_add(ixgbevf_dev)                       |                              |
> > > > > > >        device_add(ixgbevf_dev)                       |                              |
> > > > > > >         kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > > > >          netlink_broadcast()                         |                              |
> > > > > > >      ...                                             |                              |
> > > > > > >      call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > > > >       failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > > > >        failover_slave_register(ixgbevf_netdev)       |                              |
> > > > > > >         net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > > > >          dev_open(ixgbevf_netdev)                    |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |                              |   received ADD uevent from netlink fd
> > > > > > >                                                      |                              |   ...
> > > > > > >                                                      |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > > > >                                                      |                              |   (decided to renamed 'eth0' )
> > > > > > >                                                      |                              |     ip link set dev eth0 name ens4
> > > > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > > > >                                                      |                              |
> > > > > > > 
> > > > > > Given renaming slaves does not work anyway:
> > > > > I was actually thinking what if we relieve the rename restriction just for
> > > > > the failover slave? What the impact would be? I think users don't care about
> > > > > slave being renamed when it's in use, especially the initial rename.
> > > > > Thoughts?
> > > > > 
> > > > > >     would it work if we just
> > > > > > hard-coded slave names instead?
> > > > > > 
> > > > > > E.g.
> > > > > > 1. fail slave renames
> > > > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > > > >       and primary to XXnpry
> > > > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > > > may not be present.
> > > > In this scheme if VF is not there it will be renamed immediately after registration.
> > > Who will be responsible to rename the slave, the kernel?
> > That's the idea.
> > 
> > > Note the master's
> > > name may or may not come from the userspace. If it comes from the userspace,
> > > should the userspace daemon change their expectation not to name/rename
> > > _any_ slaves (today there's no distinction)?
> > Yes the idea would be to fail renaming slaves.
> No I was asking about the userspace expectation: whether it should track and
> detect the lifecycle events of failover slaves and decide what to do. How
> does it get back to the user specified name if VF is not enslaved (say
> someone unloads the virtio-net module)?

When virtio net is removed VF will shortly be removed too.

> As this scheme adds much complexity to the kernel naming convention
> (currently it's just ethX names) that no userspace can understand.

Anything that pokes at slaves needs to be specially designed anyway.
Naming seems like a minor issue.

> Will the
> change break userspace further?
> 
> -Siwei

Didn't you show userspace is already broken. You can't "further
break it", rename already fails.

> > 
> > > How do users know which name to
> > > trust, depending on which wins the race more often? Say if kernel wants a
> > > ens3npry name while userspace wants it named as ens4.
> > > 
> > > -Siwei
> > With this approach kernel will deny attempts by userspace to rename
> > slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> > will rename both slaves.
> > 
> > It seems pretty solid to me, the only issue is that in theory userspace
> > can use a name like XXXnsby for something else. But this seems unlikely.
> > 
> > 
> > > > > I don't like the idea to delay exposing failover master
> > > > > until VF is hot plugged in (probably subject to various failures) later.
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > I agree, this was not what I meant.
> > > > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                 ` <20190227193923-mutt-send-email-mst@kernel.org>
@ 2019-02-28  0:52                                   ` Jakub Kicinski
       [not found]                                   ` <20190227165205.307ed83c@cakuba.netronome.com>
       [not found]                                   ` <36901346-e3d5-4e51-6a8d-678eb5b9e352@oracle.com>
  2 siblings, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2019-02-28  0:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Samudrala, Sridhar,
	virtualization, Siwei Liu, liran.alon, Netdev, si-wei liu,
	David Miller

On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:
> > As this scheme adds much complexity to the kernel naming convention
> > (currently it's just ethX names) that no userspace can understand.  
> 
> Anything that pokes at slaves needs to be specially designed anyway.
> Naming seems like a minor issue.

Can the users who care about the naming put net_failover into
"user space will do the bond enslavement" mode, and do the bond
creation/management themselves from user space (in systemd/ 
Network Manager) based on the failover flag?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                   ` <20190227165205.307ed83c@cakuba.netronome.com>
@ 2019-02-28  1:26                                     ` Michael S. Tsirkin
       [not found]                                     ` <20190227201857-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28  1:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Samudrala, Sridhar,
	virtualization, Siwei Liu, liran.alon, Netdev, si-wei liu,
	David Miller

On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:
> > > As this scheme adds much complexity to the kernel naming convention
> > > (currently it's just ethX names) that no userspace can understand.  
> > 
> > Anything that pokes at slaves needs to be specially designed anyway.
> > Naming seems like a minor issue.
> 
> Can the users who care about the naming put net_failover into
> "user space will do the bond enslavement" mode, and do the bond
> creation/management themselves from user space (in systemd/ 
> Network Manager) based on the failover flag?

Putting issues of compatibility aside (userspace tends to be confused if
you give it two devices with same MAC), how would you have it work in
practice? Timer based hacks like netvsc where if userspace didn't
respond within X seconds we assume it won't and do everything ourselves?

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                     ` <20190227201857-mutt-send-email-mst@kernel.org>
@ 2019-02-28  1:52                                       ` Jakub Kicinski
       [not found]                                       ` <20190227175218.736e13b6@cakuba.netronome.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2019-02-28  1:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Samudrala, Sridhar,
	virtualization, Siwei Liu, liran.alon, Netdev, si-wei liu,
	David Miller

On Wed, 27 Feb 2019 20:26:02 -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> > On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:  
> > > > As this scheme adds much complexity to the kernel naming convention
> > > > (currently it's just ethX names) that no userspace can understand.    
> > > 
> > > Anything that pokes at slaves needs to be specially designed anyway.
> > > Naming seems like a minor issue.  
> > 
> > Can the users who care about the naming put net_failover into
> > "user space will do the bond enslavement" mode, and do the bond
> > creation/management themselves from user space (in systemd/ 
> > Network Manager) based on the failover flag?  
> 
> Putting issues of compatibility aside (userspace tends to be confused if
> you give it two devices with same MAC), how would you have it work in
> practice? Timer based hacks like netvsc where if userspace didn't
> respond within X seconds we assume it won't and do everything ourselves?

Well, what I'm saying is basically if user space knows how to deal with
the auto-bonding, we can put aside net_failover for the most part.  It
can either be blacklisted or it can have some knob which will
effectively disable the auto-enslavement.

Auto-bonding capable user space can do the renames, spawn the bond,
etc. all by itself.  I'm basically going back to my initial proposal
here :)  There is a RedHat bugzilla for the NetworkManager team to do
this, but we merged net_failover before those folks got around to
implementing it.

IOW if NM/systemd is capable of doing the auto-bonding itself it can
disable the kernel mechanism and take care of it all.  If kernel is
booted with an old user space which doesn't have capable NM/systemd -
net_failover will kick in and do its best.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                       ` <20190227175218.736e13b6@cakuba.netronome.com>
@ 2019-02-28  4:47                                         ` Michael S. Tsirkin
       [not found]                                         ` <20190227233812-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28  4:47 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Samudrala, Sridhar,
	virtualization, Siwei Liu, liran.alon, Netdev, si-wei liu,
	David Miller

On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 20:26:02 -0500, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> > > On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:  
> > > > > As this scheme adds much complexity to the kernel naming convention
> > > > > (currently it's just ethX names) that no userspace can understand.    
> > > > 
> > > > Anything that pokes at slaves needs to be specially designed anyway.
> > > > Naming seems like a minor issue.  
> > > 
> > > Can the users who care about the naming put net_failover into
> > > "user space will do the bond enslavement" mode, and do the bond
> > > creation/management themselves from user space (in systemd/ 
> > > Network Manager) based on the failover flag?  
> > 
> > Putting issues of compatibility aside (userspace tends to be confused if
> > you give it two devices with same MAC), how would you have it work in
> > practice? Timer based hacks like netvsc where if userspace didn't
> > respond within X seconds we assume it won't and do everything ourselves?
> 
> Well, what I'm saying is basically if user space knows how to deal with
> the auto-bonding, we can put aside net_failover for the most part.  It
> can either be blacklisted or it can have some knob which will
> effectively disable the auto-enslavement.

OK I guess we could add a module parameter to skip this.
Is this what you mean?

> Auto-bonding capable user space can do the renames, spawn the bond,
> etc. all by itself.  I'm basically going back to my initial proposal
> here :)  There is a RedHat bugzilla for the NetworkManager team to do
> this, but we merged net_failover before those folks got around to
> implementing it.

In particular because there's no policy involved whatsoever
here so it's just mechanism being pushed up to userspace.

> IOW if NM/systemd is capable of doing the auto-bonding itself it can
> disable the kernel mechanism and take care of it all.  If kernel is
> booted with an old user space which doesn't have capable NM/systemd -
> net_failover will kick in and do its best.

Sure - it's just 2 lines of code, see below.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

But I don't intend to bother until there's actual interest from
userspace developers to bother. In particular it is not just NM/systemd
even on Fedora - e.g. you will need to teach dracut to somehow detect
and handle this - right now it gets confused if there are two devices
with same MAC addresses.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 955b3e76eb8d..dd2b2c370003 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(disable_failover, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	virtnet_init_settings(dev);
 
-	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
+		!disable_failover) {
 		vi->failover = net_failover_create(vi->dev);
 		if (IS_ERR(vi->failover)) {
 			err = PTR_ERR(vi->failover);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                   ` <36901346-e3d5-4e51-6a8d-678eb5b9e352@oracle.com>
@ 2019-02-28 14:26                                     ` Michael S. Tsirkin
       [not found]                                     ` <20190228091119-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28 14:26 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	David Miller

On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
> > > Will the
> > > change break userspace further?
> > > 
> > > -Siwei
> > Didn't you show userspace is already broken. You can't "further
> > break it", rename already fails.
> It's a race, userspace tends to give slave a user(space) desired name but
> sometimes may fail due to this race. Today if failover master is not up,
> rename would succeed anyway. While what you proposed prohibits user from
> providing a name in all circumstances if I understand you correctly. That's
> what I meant of breaking userspace further. On the other hand, you seem to
> tighten the kernel default naming to udev predictable names, which is
> derived from only recent systemd-udevd, while there exists many possible
> userspace naming schemes out of that. Users today who deliberately chooses
> to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
> kernel provided names would expect the ethX pattern, with this change
> admin/user scripts which matches the ethX pattern could potentially break.

Whatever crashes with a name not matching ethX will crash on the
standby interface *anyway*.

So I think what you are saying is that someone might have already
written scripts and gotten them to work on v4.17 when STANDBY was
included and these scripts rely on ethX. Now these scripts
will break.

Maybe it is still early enough (just half a year passed) that the
number of these users would be small.  So how about a kernel config
option and maybe a module parameter to rename the primary?  People can
then opt in to the old broken behaviour.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                         ` <20190227233812-mutt-send-email-mst@kernel.org>
@ 2019-02-28 18:13                                           ` Jakub Kicinski
       [not found]                                           ` <20190228101356.39ac70aa@cakuba.netronome.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2019-02-28 18:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Samudrala, Sridhar,
	virtualization, Siwei Liu, liran.alon, Netdev, si-wei liu,
	David Miller

On Wed, 27 Feb 2019 23:47:33 -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> > > > Can the users who care about the naming put net_failover into
> > > > "user space will do the bond enslavement" mode, and do the bond
> > > > creation/management themselves from user space (in systemd/ 
> > > > Network Manager) based on the failover flag?    
> > > 
> > > Putting issues of compatibility aside (userspace tends to be confused if
> > > you give it two devices with same MAC), how would you have it work in
> > > practice? Timer based hacks like netvsc where if userspace didn't
> > > respond within X seconds we assume it won't and do everything ourselves?  
> > 
> > Well, what I'm saying is basically if user space knows how to deal with
> > the auto-bonding, we can put aside net_failover for the most part.  It
> > can either be blacklisted or it can have some knob which will
> > effectively disable the auto-enslavement.  
> 
> OK I guess we could add a module parameter to skip this.
> Is this what you mean?

Yup.

> > Auto-bonding capable user space can do the renames, spawn the bond,
> > etc. all by itself.  I'm basically going back to my initial proposal
> > here :)  There is a RedHat bugzilla for the NetworkManager team to do
> > this, but we merged net_failover before those folks got around to
> > implementing it.  
> 
> In particular because there's no policy involved whatsoever
> here so it's just mechanism being pushed up to userspace.
> 
> > IOW if NM/systemd is capable of doing the auto-bonding itself it can
> > disable the kernel mechanism and take care of it all.  If kernel is
> > booted with an old user space which doesn't have capable NM/systemd -
> > net_failover will kick in and do its best.  
> 
> Sure - it's just 2 lines of code, see below.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> But I don't intend to bother until there's actual interest from
> userspace developers to bother. In particular it is not just NM/systemd
> even on Fedora - e.g. you will need to teach dracut to somehow detect
> and handle this - right now it gets confused if there are two devices
> with same MAC addresses.

It is a bit of a the chicken or the egg situation ;)  But users can
just blacklist, too.  Anyway, I think this is far better than module
parameters for twiddling kernel-based interface naming policy.. :S

> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 955b3e76eb8d..dd2b2c370003 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
>  module_param(csum, bool, 0444);
>  module_param(gso, bool, 0444);
>  module_param(napi_tx, bool, 0644);
> +module_param(disable_failover, bool, 0644);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> @@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	virtnet_init_settings(dev);
>  
> -	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
> +		!disable_failover) {
>  		vi->failover = net_failover_create(vi->dev);
>  		if (IS_ERR(vi->failover)) {
>  			err = PTR_ERR(vi->failover);
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                           ` <20190228101356.39ac70aa@cakuba.netronome.com>
@ 2019-02-28 19:36                                             ` Michael S. Tsirkin
       [not found]                                             ` <20190228143511-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28 19:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Samudrala, Sridhar,
	virtualization, Siwei Liu, liran.alon, Netdev, si-wei liu,
	David Miller

On Thu, Feb 28, 2019 at 10:13:56AM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 23:47:33 -0500, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> > > > > Can the users who care about the naming put net_failover into
> > > > > "user space will do the bond enslavement" mode, and do the bond
> > > > > creation/management themselves from user space (in systemd/ 
> > > > > Network Manager) based on the failover flag?    
> > > > 
> > > > Putting issues of compatibility aside (userspace tends to be confused if
> > > > you give it two devices with same MAC), how would you have it work in
> > > > practice? Timer based hacks like netvsc where if userspace didn't
> > > > respond within X seconds we assume it won't and do everything ourselves?  
> > > 
> > > Well, what I'm saying is basically if user space knows how to deal with
> > > the auto-bonding, we can put aside net_failover for the most part.  It
> > > can either be blacklisted or it can have some knob which will
> > > effectively disable the auto-enslavement.  
> > 
> > OK I guess we could add a module parameter to skip this.
> > Is this what you mean?
> 
> Yup.
> 
> > > Auto-bonding capable user space can do the renames, spawn the bond,
> > > etc. all by itself.  I'm basically going back to my initial proposal
> > > here :)  There is a RedHat bugzilla for the NetworkManager team to do
> > > this, but we merged net_failover before those folks got around to
> > > implementing it.  
> > 
> > In particular because there's no policy involved whatsoever
> > here so it's just mechanism being pushed up to userspace.
> > 
> > > IOW if NM/systemd is capable of doing the auto-bonding itself it can
> > > disable the kernel mechanism and take care of it all.  If kernel is
> > > booted with an old user space which doesn't have capable NM/systemd -
> > > net_failover will kick in and do its best.  
> > 
> > Sure - it's just 2 lines of code, see below.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > But I don't intend to bother until there's actual interest from
> > userspace developers to bother. In particular it is not just NM/systemd
> > even on Fedora - e.g. you will need to teach dracut to somehow detect
> > and handle this - right now it gets confused if there are two devices
> > with same MAC addresses.
> 
> It is a bit of a the chicken or the egg situation ;)  But users can
> just blacklist, too.  Anyway, I think this is far better than module
> parameters

Sorry I'm a bit confused. What is better than what?

> for twiddling kernel-based interface naming policy.. :S

I see your point. But my point is slave names don't really matter, only
master name matters.  So I am not sure there's any policy worth talking
about here.

> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 955b3e76eb8d..dd2b2c370003 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
> >  module_param(csum, bool, 0444);
> >  module_param(gso, bool, 0444);
> >  module_param(napi_tx, bool, 0644);
> > +module_param(disable_failover, bool, 0644);
> >  
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > @@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> >  	virtnet_init_settings(dev);
> >  
> > -	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> > +	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
> > +		!disable_failover) {
> >  		vi->failover = net_failover_create(vi->dev);
> >  		if (IS_ERR(vi->failover)) {
> >  			err = PTR_ERR(vi->failover);
> > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                             ` <20190228143511-mutt-send-email-mst@kernel.org>
@ 2019-02-28 19:56                                               ` Jakub Kicinski
       [not found]                                               ` <20190228115641.7afe6f09@cakuba.netronome.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2019-02-28 19:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Jiri Pirko, Samudrala, Sridhar, virtualization,
	Siwei Liu, liran.alon, Netdev, si-wei liu, David Miller

On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > It is a bit of a the chicken or the egg situation ;)  But users can
> > just blacklist, too.  Anyway, I think this is far better than module
> > parameters  
> 
> Sorry I'm a bit confused. What is better than what?

I mean that blacklist net_failover or module param to disable
net_failover and handle in user space are better than trying to solve
the renaming at kernel level (either by adding module params that make
the kernel rename devices or letting user space change names of running
devices if they are slaves).

> > for twiddling kernel-based interface naming policy.. :S  
> 
> I see your point. But my point is slave names don't really matter, only
> master name matters.  So I am not sure there's any policy worth talking
> about here.

Oh yes, I don't disagree with you, but others seems to want to rename
the auto-bonded lower devices.  Which can be done trivially if it was 
a daemon in user space instantiating the auto-bond.  We are just
providing a basic version of auto-bonding in the kernel.  If there are
extra requirements on policy, or naming - the whole thing is better
solved in user space.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                               ` <20190228115641.7afe6f09@cakuba.netronome.com>
@ 2019-02-28 20:14                                                 ` Michael S. Tsirkin
       [not found]                                                 ` <20190228151349-mutt-send-email-mst@kernel.org>
                                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28 20:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, Jiri Pirko, Samudrala, Sridhar, virtualization,
	Siwei Liu, liran.alon, Netdev, si-wei liu, David Miller

On Thu, Feb 28, 2019 at 11:56:41AM -0800, Jakub Kicinski wrote:
> On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > just blacklist, too.  Anyway, I think this is far better than module
> > > parameters  
> > 
> > Sorry I'm a bit confused. What is better than what?
> 
> I mean that blacklist net_failover or module param to disable
> net_failover and handle in user space are better than trying to solve
> the renaming at kernel level (either by adding module params that make
> the kernel rename devices or letting user space change names of running
> devices if they are slaves).
> 
> > > for twiddling kernel-based interface naming policy.. :S  
> > 
> > I see your point. But my point is slave names don't really matter, only
> > master name matters.  So I am not sure there's any policy worth talking
> > about here.
> 
> Oh yes, I don't disagree with you, but others seems to want to rename
> the auto-bonded lower devices.  Which can be done trivially if it was 
> a daemon in user space instantiating the auto-bond.  We are just
> providing a basic version of auto-bonding in the kernel.  If there are
> extra requirements on policy, or naming - the whole thing is better
> solved in user space.

OK so it seems that you would be happy with a combination of the module
parameter disabling failover completely and renaming primary in kernel?
Did I get it right?

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                                 ` <20190228151349-mutt-send-email-mst@kernel.org>
@ 2019-02-28 23:31                                                   ` Jakub Kicinski
  0 siblings, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2019-02-28 23:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Jiri Pirko, Samudrala, Sridhar, virtualization,
	Siwei Liu, liran.alon, Netdev, si-wei liu, David Miller

On Thu, 28 Feb 2019 15:14:55 -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 28, 2019 at 11:56:41AM -0800, Jakub Kicinski wrote:
> > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:  
> > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > parameters    
> > > 
> > > Sorry I'm a bit confused. What is better than what?  
> > 
> > I mean that blacklist net_failover or module param to disable
> > net_failover and handle in user space are better than trying to solve
> > the renaming at kernel level (either by adding module params that make
> > the kernel rename devices or letting user space change names of running
> > devices if they are slaves).
> >   
> > > > for twiddling kernel-based interface naming policy.. :S    
> > > 
> > > I see your point. But my point is slave names don't really matter, only
> > > master name matters.  So I am not sure there's any policy worth talking
> > > about here.  
> > 
> > Oh yes, I don't disagree with you, but others seems to want to rename
> > the auto-bonded lower devices.  Which can be done trivially if it was 
> > a daemon in user space instantiating the auto-bond.  We are just
> > providing a basic version of auto-bonding in the kernel.  If there are
> > extra requirements on policy, or naming - the whole thing is better
> > solved in user space.  
> 
> OK so it seems that you would be happy with a combination of the module
> parameter disabling failover completely and renaming primary in kernel?
> Did I get it right?

Not 100%, I'm personally not convinced that renaming primary in the
kernel is okay.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                               ` <20190228115641.7afe6f09@cakuba.netronome.com>
  2019-02-28 20:14                                                 ` Michael S. Tsirkin
       [not found]                                                 ` <20190228151349-mutt-send-email-mst@kernel.org>
@ 2019-03-01  0:20                                                 ` Siwei Liu
       [not found]                                                 ` <CADGSJ239j-fG_vifkj-gTnAb_RXtBEbnQCFdZGimcgPXh35CMA@mail.gmail.com>
  3 siblings, 0 replies; 47+ messages in thread
From: Siwei Liu @ 2019-03-01  0:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin,
	Samudrala, Sridhar, virtualization, liran.alon, Netdev,
	si-wei liu, David Miller

On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski <kubakici@wp.pl> wrote:
>
> On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > just blacklist, too.  Anyway, I think this is far better than module
> > > parameters
> >
> > Sorry I'm a bit confused. What is better than what?
>
> I mean that blacklist net_failover or module param to disable
> net_failover and handle in user space are better than trying to solve
> the renaming at kernel level (either by adding module params that make
> the kernel rename devices or letting user space change names of running
> devices if they are slaves).

Before I was aksed to revive this old mail thread, I knew the
discussion could end up with something like this. Yes, theoretically
there's a point - basically you don't believe kernel should take risk
in fixing the issue, so you push back the hope to something in
hypothesis that actually wasn't done and hard to get done in reality.
It's not too different than saying "hey, what you're asking for is
simply wrong, don't do it! Go back to modify userspace to create a
bond or team instead!" FWIW I want to emphasize that the debate for
what should be the right place to implement this failover facility:
userspace versus kernel, had been around for almost a decade, and no
real work ever happened in userspace to "standardize" this in the
Linux world.  The truth is that it's quite amount of complex work to
get it implemented right at userspace in reality: what Michael
mentions about making dracut auto-bonding aware is just tip of the
iceberg. Basically one would need to modify all the existing network
config tools to treat them well with this new auto-bonding concept:
handle duplicate MACs, differentiate it with regular bond/team, fix
boot time dependency of network boot and etc. Moreover, it's not a
single distro's effort from cloud provider's perspective, at least not
as simple as to say just move it to a daemon systemd/NM then work is
done. We (Oracle) had done extensive work in the past year to help
align various userspace components and work with distro vendors to
patch shipped packages to make them work with the failover 3-netdev
model. The work that needs to be done with userspace auto-bonding
would be more involved than just that, with quite trivial value (just
naming?) in turn that I suspect any developer in userspace could be
motivated.

So, simply put, no, we have zero interest in this direction. If
upstream believes this is the final conclusion, I think we can stop
discussing.

Thanks,
-Siwei
>
> > > for twiddling kernel-based interface naming policy.. :S
> >
> > I see your point. But my point is slave names don't really matter, only
> > master name matters.  So I am not sure there's any policy worth talking
> > about here.
>
> Oh yes, I don't disagree with you, but others seems to want to rename
> the auto-bonded lower devices.  Which can be done trivially if it was
> a daemon in user space instantiating the auto-bond.  We are just
> providing a basic version of auto-bonding in the kernel.  If there are
> extra requirements on policy, or naming - the whole thing is better
> solved in user space.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                                 ` <CADGSJ239j-fG_vifkj-gTnAb_RXtBEbnQCFdZGimcgPXh35CMA@mail.gmail.com>
@ 2019-03-01  1:05                                                   ` Jakub Kicinski
       [not found]                                                   ` <20190228170520.527ed6df@cakuba.netronome.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2019-03-01  1:05 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin,
	Samudrala, Sridhar, virtualization, liran.alon, Netdev,
	si-wei liu, David Miller

On Thu, 28 Feb 2019 16:20:28 -0800, Siwei Liu wrote:
> On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski wrote:
> > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:  
> > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > parameters  
> > >
> > > Sorry I'm a bit confused. What is better than what?  
> >
> > I mean that blacklist net_failover or module param to disable
> > net_failover and handle in user space are better than trying to solve
> > the renaming at kernel level (either by adding module params that make
> > the kernel rename devices or letting user space change names of running
> > devices if they are slaves).  
> 
> Before I was aksed to revive this old mail thread, I knew the
> discussion could end up with something like this. Yes, theoretically
> there's a point - basically you don't believe kernel should take risk
> in fixing the issue, so you push back the hope to something in
> hypothesis that actually wasn't done and hard to get done in reality.
> It's not too different than saying "hey, what you're asking for is
> simply wrong, don't do it! Go back to modify userspace to create a
> bond or team instead!" FWIW I want to emphasize that the debate for
> what should be the right place to implement this failover facility:
> userspace versus kernel, had been around for almost a decade, and no
> real work ever happened in userspace to "standardize" this in the
> Linux world.

Let me offer you my very subjective opinion of why "no real work ever
happened in user space".  The actors who have primary interest to get
the auto-bonding working are HW vendors trying to either convince
customers to use SR-IOV, or being pressured by customers to make SR-IOV
easier to consume.  HW vendors hire driver developers, not user space
developers.  So the solution we arrive at is in the kernel for a non
technical reason (Conway's law, sort of).

$ cd NetworkManager/
$ git log --pretty=format:"%ae" | \
    grep '\(mellanox\|intel\|broadcom\|netronome\)' | sort | uniq -c
     81 andrew.zaborowski@intel.com
      2 David.Woodhouse@intel.com
      2 ismo.puustinen@intel.com
      1 michael.i.doherty@intel.com

Andrew works on WiFi.

I have asked the NetworkManager folks to implement this feature last
year when net_failover got dangerously close to getting merged, and
they said they were never approached with this request before, much less
offered code that solve it.  Unfortunately before they got around to it
net_failover was merged already, and they didn't proceed.  

So to my knowledge nobody ever tried to solve this in user space.
I don't think net_failover is particularly terrible, or that renaming
of primary in the kernel is the end of the world, but I'd appreciate if
you could point me to efforts to solve it upstream in user space
components, or acknowledge that nobody actually tried that.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                       ` <8a387954-1e21-947b-a5a9-c49adaea2e81@oracle.com>
@ 2019-03-01 13:27                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2019-03-01 13:27 UTC (permalink / raw)
  To: si-wei liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, liran.alon, Netdev,
	David Miller

On Thu, Feb 28, 2019 at 05:30:56PM -0800, si-wei liu wrote:
> 
> 
> On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
> > > > > Will the
> > > > > change break userspace further?
> > > > > 
> > > > > -Siwei
> > > > Didn't you show userspace is already broken. You can't "further
> > > > break it", rename already fails.
> > > It's a race, userspace tends to give slave a user(space) desired name but
> > > sometimes may fail due to this race. Today if failover master is not up,
> > > rename would succeed anyway. While what you proposed prohibits user from
> > > providing a name in all circumstances if I understand you correctly. That's
> > > what I meant of breaking userspace further. On the other hand, you seem to
> > > tighten the kernel default naming to udev predictable names, which is
> > > derived from only recent systemd-udevd, while there exists many possible
> > > userspace naming schemes out of that. Users today who deliberately chooses
> > > to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
> > > kernel provided names would expect the ethX pattern, with this change
> > > admin/user scripts which matches the ethX pattern could potentially break.
> > Whatever crashes with a name not matching ethX will crash on the
> > standby interface *anyway*.
> With udev predictable naming disabled they should not. It's not hard for
> user to look for device attribute to persistent the name well, in a
> consistent and reliable way.

Well that's special code for failover already. So far we just
taught userspace to skip renaming slave interfaces.

> > 
> > So I think what you are saying is that someone might have already
> > written scripts and gotten them to work on v4.17 when STANDBY was
> > included and these scripts rely on ethX. Now these scripts
> > will break.
> The controversial part is the new kernel naming pattern. Initially I thought
> there shouldn't be such crazy scripts relying on the pattern, but when I
> worked on cloud-init it I realized that there's already a lot of software
> taking assumption around the 'eth0' name. In the past I've seen random
> scripts that parses the ethX name assumes (incorrectly) the name ends up
> with digits, or even the digits and name are 1:1 mapped. Of course, you can
> say these are bugs in scripts themselves.

No what I say is that they will crash on rename of standby too.

> Anyway, I'll let others in the netdev to comment on this new scheme, maybe
> that's the concern of merely myself. The good part of your proposal is that
> we can get consistent slave name, which still plays its role until we move
> towards making slave names less relevant, i.e. ideally a 1-netdev model. I
> think we both agree that the master matters more than the slave names.
> > 
> > Maybe it is still early enough (just half a year passed) that the
> > number of these users would be small.  So how about a kernel config
> > option and maybe a module parameter to rename the primary?  People can
> > then opt in to the old broken behaviour.
> Were I could I would ask  why a similar opt-in (kernel config or module
> parameter) couldn't be implemented to open up the rename restriction on
> slave, net_failover in particular. What I felt about this rename restriction
> was more because of historical reason than anything else, while net_failover
> is comparatively a new type of link that we are now designing proper use
> case it should support, and can get it shaped to whatever it fits. My
> personal view is that the slave can't be renamed when master is running is
> just implementation details that got incorrectly exposed to userspace apps
> for many years. It's old behavior with historical reason for sure, but I
> don't think this applies to net_failover.
> 
> (FWIW as one previous bond maintainer for another OS, we relieved the rename
> restriction slaves 13 year ago, while no single complaint or issue was ever
> raised because of this change over the years, neither from the customers of
> tens of millions of installation base, nor the FOSS software running atop.
> Of course, Linux is different so that experience doesn't count.)
> 
> Thanks,
> -Siwei
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]                                                   ` <20190228170520.527ed6df@cakuba.netronome.com>
@ 2019-03-02  0:30                                                     ` Siwei Liu
  0 siblings, 0 replies; 47+ messages in thread
From: Siwei Liu @ 2019-03-02  0:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin,
	Samudrala, Sridhar, virtualization, liran.alon, Netdev,
	si-wei liu, David Miller

On Thu, Feb 28, 2019 at 5:05 PM Jakub Kicinski <kubakici@wp.pl> wrote:
>
> On Thu, 28 Feb 2019 16:20:28 -0800, Siwei Liu wrote:
> > On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski wrote:
> > > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > > parameters
> > > >
> > > > Sorry I'm a bit confused. What is better than what?
> > >
> > > I mean that blacklist net_failover or module param to disable
> > > net_failover and handle in user space are better than trying to solve
> > > the renaming at kernel level (either by adding module params that make
> > > the kernel rename devices or letting user space change names of running
> > > devices if they are slaves).
> >
> > Before I was aksed to revive this old mail thread, I knew the
> > discussion could end up with something like this. Yes, theoretically
> > there's a point - basically you don't believe kernel should take risk
> > in fixing the issue, so you push back the hope to something in
> > hypothesis that actually wasn't done and hard to get done in reality.
> > It's not too different than saying "hey, what you're asking for is
> > simply wrong, don't do it! Go back to modify userspace to create a
> > bond or team instead!" FWIW I want to emphasize that the debate for
> > what should be the right place to implement this failover facility:
> > userspace versus kernel, had been around for almost a decade, and no
> > real work ever happened in userspace to "standardize" this in the
> > Linux world.
>
> Let me offer you my very subjective opinion of why "no real work ever
> happened in user space".  The actors who have primary interest to get
> the auto-bonding working are HW vendors trying to either convince
> customers to use SR-IOV, or being pressured by customers to make SR-IOV
> easier to consume.  HW vendors hire driver developers, not user space
> developers.  So the solution we arrive at is in the kernel for a non
> technical reason (Conway's law, sort of).
>
> $ cd NetworkManager/
> $ git log --pretty=format:"%ae" | \
>     grep '\(mellanox\|intel\|broadcom\|netronome\)' | sort | uniq -c
>      81 andrew.zaborowski@intel.com
>       2 David.Woodhouse@intel.com
>       2 ismo.puustinen@intel.com
>       1 michael.i.doherty@intel.com
>
> Andrew works on WiFi.
>

I'm sorry, but we don't use NetworkManager in our cloud images at all.
We sufferd from lots of problems when booting from remote iSCSI disk
with NetworkManager enabled, and it looks like those issues are still
there while that's not (my subjective impression) a network config
tool mainly targeting desktop and WiFi users ever cares about. At
least a sign of lack of sufficient testing was made there.

From cloud service provider perspective, we always prefer single
central solution than speak to various distro vendors with their own
network daemons/config tools thus different solutions. It's hard to
coordicate all efforts in one place. From my personal perspetive, the
in-kernel auto-slave solution is nothing technically inferior than any
userspace implementation, and every major OS/cloud providers choose to
implement this in-kernel model for the same reason. I don't want to
argue more if there's value or not for net_failover to be in Linux
kernel, given that it's already there I think it's better to move on.

We have done extensive work in reporting (actually, fix them
internally before posting) issues to the dracut, udev,
initramfs-tools, and cloud-init community. Although as claimed the
3-netdev should be transparent to userspace in general, the reality is
opposite: the effort is nothing differenet than bring up a new type of
virutal bond than any existing userspace tool would otherwise expect
for a regular physical netdev. If there's ever concern about breaking
userspace, I bet no one ever tries to start using it. If they did they
know what I am saying. The dup MAC address setting and plugging order
are totally new to userspace that none of userspace tools fail to know
how to plumb failover interface in a proper way, if without fixing
them one or another.

-Siwei

> I have asked the NetworkManager folks to implement this feature last
> year when net_failover got dangerously close to getting merged, and
> they said they were never approached with this request before, much less
> offered code that solve it.  Unfortunately before they got around to it
> net_failover was merged already, and they didn't proceed.
>
> So to my knowledge nobody ever tried to solve this in user space.
> I don't think net_failover is particularly terrible, or that renaming
> of primary in the kernel is the end of the world, but I'd appreciate if
> you could point me to efforts to solve it upstream in user space
> components, or acknowledge that nobody actually tried that.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2019-03-02  0:30 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1523386790-12396-1-git-send-email-sridhar.samudrala@intel.com>
     [not found] ` <1523386790-12396-5-git-send-email-sridhar.samudrala@intel.com>
2018-04-10 21:26   ` [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework Stephen Hemminger
     [not found]   ` <20180410142608.50f15b45@xeon-e3>
2018-04-10 22:56     ` Samudrala, Sridhar
2018-04-10 23:28     ` Michael S. Tsirkin
     [not found]     ` <20180411022807-mutt-send-email-mst@kernel.org>
2018-04-10 23:44       ` Siwei Liu
     [not found]       ` <CADGSJ22rVsC0TDTd6OKVnwbx0ExoQ8xWXBMumKB-OFH4sX=yaQ@mail.gmail.com>
2018-04-10 23:59         ` Stephen Hemminger
2018-04-11  7:50       ` Jiri Pirko
2018-04-11  1:21     ` Michael S. Tsirkin
2018-04-11  7:53     ` Jiri Pirko
     [not found]     ` <20180411075334.GK2028@nanopsycho>
2019-02-22  1:14       ` net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework) Siwei Liu
     [not found]       ` <CADGSJ214RJV_zWVBGv0Ydo=CJj6WESTYAH=PpaYLFHdtWVrm3g@mail.gmail.com>
2019-02-22  1:39         ` Michael S. Tsirkin
     [not found]         ` <20190221203808-mutt-send-email-mst@kernel.org>
     [not found]           ` <581e4399-3969-aecd-e923-03bbc0880733@oracle.com>
2019-02-22  7:00             ` [virtio-dev] " Samudrala, Sridhar
     [not found]             ` <91d4cbb1-be7a-b53c-6b2a-99bef07e7c53@intel.com>
     [not found]               ` <d9ef40a2-237b-0cce-4401-ecaeac4c602a@oracle.com>
2019-02-22 15:14                 ` Michael S. Tsirkin
     [not found]                 ` <20190222100753-mutt-send-email-mst@kernel.org>
     [not found]                   ` <e6a53bd1-83ab-f170-406a-03276e8c87e2@oracle.com>
2019-02-26  1:39                     ` Stephen Hemminger
     [not found]                     ` <20190225173912.26b93422@shemminger-XPS-13-9360>
2019-02-26  2:05                       ` Michael S. Tsirkin
2019-02-26  2:08                     ` Michael S. Tsirkin
     [not found]                     ` <20190225210529-mutt-send-email-mst@kernel.org>
     [not found]                       ` <d1060c75-eaba-ab6f-ff31-38cb3a47c711@oracle.com>
2019-02-27 21:57                         ` Stephen Hemminger
2019-02-27 22:38                         ` Michael S. Tsirkin
     [not found]                         ` <20190227173710-mutt-send-email-mst@kernel.org>
     [not found]                           ` <c72ce9eb-254c-cc3e-1969-f7f108506d5e@oracle.com>
2019-02-27 23:50                             ` Michael S. Tsirkin
     [not found]                             ` <20190227184601-mutt-send-email-mst@kernel.org>
2019-02-28  0:00                               ` Liran Alon
2019-02-28  0:03                               ` Stephen Hemminger
     [not found]                               ` <20190227160342.788dc2b4@shemminger-XPS-13-9360>
2019-02-28  0:38                                 ` Michael S. Tsirkin
     [not found]                               ` <a617ce13-4114-469d-ef33-a1c91150eeca@oracle.com>
2019-02-28  0:41                                 ` Michael S. Tsirkin
     [not found]                                 ` <20190227193923-mutt-send-email-mst@kernel.org>
2019-02-28  0:52                                   ` Jakub Kicinski
     [not found]                                   ` <20190227165205.307ed83c@cakuba.netronome.com>
2019-02-28  1:26                                     ` Michael S. Tsirkin
     [not found]                                     ` <20190227201857-mutt-send-email-mst@kernel.org>
2019-02-28  1:52                                       ` Jakub Kicinski
     [not found]                                       ` <20190227175218.736e13b6@cakuba.netronome.com>
2019-02-28  4:47                                         ` Michael S. Tsirkin
     [not found]                                         ` <20190227233812-mutt-send-email-mst@kernel.org>
2019-02-28 18:13                                           ` Jakub Kicinski
     [not found]                                           ` <20190228101356.39ac70aa@cakuba.netronome.com>
2019-02-28 19:36                                             ` Michael S. Tsirkin
     [not found]                                             ` <20190228143511-mutt-send-email-mst@kernel.org>
2019-02-28 19:56                                               ` Jakub Kicinski
     [not found]                                               ` <20190228115641.7afe6f09@cakuba.netronome.com>
2019-02-28 20:14                                                 ` Michael S. Tsirkin
     [not found]                                                 ` <20190228151349-mutt-send-email-mst@kernel.org>
2019-02-28 23:31                                                   ` Jakub Kicinski
2019-03-01  0:20                                                 ` Siwei Liu
     [not found]                                                 ` <CADGSJ239j-fG_vifkj-gTnAb_RXtBEbnQCFdZGimcgPXh35CMA@mail.gmail.com>
2019-03-01  1:05                                                   ` Jakub Kicinski
     [not found]                                                   ` <20190228170520.527ed6df@cakuba.netronome.com>
2019-03-02  0:30                                                     ` Siwei Liu
     [not found]                                   ` <36901346-e3d5-4e51-6a8d-678eb5b9e352@oracle.com>
2019-02-28 14:26                                     ` Michael S. Tsirkin
     [not found]                                     ` <20190228091119-mutt-send-email-mst@kernel.org>
     [not found]                                       ` <8a387954-1e21-947b-a5a9-c49adaea2e81@oracle.com>
2019-03-01 13:27                                         ` Michael S. Tsirkin
     [not found] ` <1523386790-12396-3-git-send-email-sridhar.samudrala@intel.com>
2018-04-11 15:51   ` [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module Jiri Pirko
     [not found]   ` <20180411155127.GQ2028@nanopsycho>
2018-04-11 19:13     ` Samudrala, Sridhar
     [not found]     ` <6a8c1ff5-153a-e40a-91b3-48532b8d3a38@intel.com>
2018-04-18  9:25       ` Jiri Pirko
     [not found]       ` <20180418092515.GB1989@nanopsycho>
2018-04-18 18:43         ` Samudrala, Sridhar
     [not found]         ` <dd0d53f0-f5da-cb7d-f8f6-d0c8245eb3cf@intel.com>
2018-04-18 19:13           ` Jiri Pirko
     [not found]           ` <20180418191315.GA1922@nanopsycho>
2018-04-18 19:46             ` Michael S. Tsirkin
2018-04-18 20:32               ` Jiri Pirko
     [not found]               ` <20180418203206.GC1922@nanopsycho>
2018-04-18 22:46                 ` Samudrala, Sridhar
2018-04-19  4:08                 ` Michael S. Tsirkin
     [not found]                 ` <ff0c5ea1-16d4-00a7-9952-9049efa818eb@intel.com>
2018-04-19  6:35                   ` Jiri Pirko
     [not found]                 ` <20180419070752-mutt-send-email-mst@kernel.org>
2018-04-19  7:22                   ` Jiri Pirko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox