* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Michael S. Tsirkin @ 2018-04-18 16:20 UTC (permalink / raw)
To: Anshuman Khandual
Cc: robh, Benjamin Herrenschmidt, linux-kernel, virtualization,
Christoph Hellwig, joe, linuxppc-dev, elfring, david
In-Reply-To: <002f153f-452d-f64b-4fc7-7f3383b39162@linux.vnet.ibm.com>
On Wed, Apr 18, 2018 at 08:47:10AM +0530, Anshuman Khandual wrote:
> On 04/15/2018 05:41 PM, Christoph Hellwig wrote:
> > On Fri, Apr 06, 2018 at 06:37:18PM +1000, Benjamin Herrenschmidt wrote:
> >>>> implemented as DMA API which the virtio core understands. There is no
> >>>> need for an IOMMU to be involved for the device representation in this
> >>>> case IMHO.
> >>>
> >>> This whole virtio translation issue is a mess. I think we need to
> >>> switch it to the dma API, and then quirk the legacy case to always
> >>> use the direct mapping inside the dma API.
> >>
> >> Fine with using a dma API always on the Linux side, but we do want to
> >> special case virtio still at the arch and qemu side to have a "direct
> >> mapping" mode. Not sure how (special flags on PCI devices) to avoid
> >> actually going through an emulated IOMMU on the qemu side, because that
> >> slows things down, esp. with vhost.
> >>
> >> IE, we can't I think just treat it the same as a physical device.
> >
> > We should have treated it like a physical device from the start, but
> > that device has unfortunately sailed.
> >
> > But yes, we'll need a per-device quirk that says 'don't attach an
> > iommu'.
>
> How about doing it per platform basis as suggested in this RFC through
> an arch specific callback. Because all the virtio devices in the given
> platform would require and exercise this option (to avail bounce buffer
> mechanism for secure guests as an example). So the flag basically is a
> platform specific one not a device specific one.
That's not the case. A single platform can have a mix of virtio and
non-virtio devices. Same applies even within virtio, e.g. the balloon
device always bypasses an iommu. Further, QEMU supports out of process
devices some of which might bypass the IOMMU.
--
MST
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Mikulas Patocka @ 2018-04-18 16:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: Michael S. Tsirkin, netdev, linux-kernel, virtualization,
Eric Dumazet, Joby Poriyath, Ben Hutchings, David S. Miller
In-Reply-To: <3e65977e-53cd-bf09-bc4b-0ce40e9091fe@gmail.com>
On Wed, 18 Apr 2018, Eric Dumazet wrote:
>
>
> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> > The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> > fails (later patches change it to kvzalloc).
> >
> > The problem with this is that if the vzalloc function is actually used,
> > virtio_net doesn't work (because it expects that the extra memory should
> > be accessible with DMA-API and memory allocated with vzalloc isn't).
> >
> > This patch changes it back to kzalloc and adds a warning if the allocated
> > size is too large (the allocation is unreliable in this case).
> >
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> >
> > ---
> > net/core/dev.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6/net/core/dev.c
> > ===================================================================
> > --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
> > @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> > /* ensure 32-byte alignment of whole construct */
> > alloc_size += NETDEV_ALIGN - 1;
> >
> > - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> > + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > if (!p)
> > return NULL;
> >
> >
>
> Since when a net_device needs to be in DMA zone ???
>
> I would rather fix virtio_net, this looks very suspect to me.
>
> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
The structure net_device is followed by arbitrary driver-specific data
(accessible with the function netdev_priv). And for virtio-net, these
driver-specific data must be in DMA memory.
Mikulas
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Eric Dumazet @ 2018-04-18 16:51 UTC (permalink / raw)
To: Mikulas Patocka, Eric Dumazet
Cc: Michael S. Tsirkin, netdev, linux-kernel, virtualization,
Eric Dumazet, Joby Poriyath, Ben Hutchings, David S. Miller
In-Reply-To: <alpine.LRH.2.02.1804181218270.19136@file01.intranet.prod.int.rdu2.redhat.com>
On 04/18/2018 09:44 AM, Mikulas Patocka wrote:
>
>
> On Wed, 18 Apr 2018, Eric Dumazet wrote:
>
>>
>>
>> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
>>> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
>>> fails (later patches change it to kvzalloc).
>>>
>>> The problem with this is that if the vzalloc function is actually used,
>>> virtio_net doesn't work (because it expects that the extra memory should
>>> be accessible with DMA-API and memory allocated with vzalloc isn't).
>>>
>>> This patch changes it back to kzalloc and adds a warning if the allocated
>>> size is too large (the allocation is unreliable in this case).
>>>
>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
>>>
>>> ---
>>> net/core/dev.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-2.6/net/core/dev.c
>>> ===================================================================
>>> --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
>>> +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
>>> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
>>> /* ensure 32-byte alignment of whole construct */
>>> alloc_size += NETDEV_ALIGN - 1;
>>>
>>> - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>>> + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
>>> + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>>> if (!p)
>>> return NULL;
>>>
>>>
>>
>> Since when a net_device needs to be in DMA zone ???
>>
>> I would rather fix virtio_net, this looks very suspect to me.
>>
>> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
>> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
>
> The structure net_device is followed by arbitrary driver-specific data
> (accessible with the function netdev_priv). And for virtio-net, these
> driver-specific data must be in DMA memory.
I get that, but how is the original xenvif problem will be solved ?
Your patch would add a bug in some other driver(s)
I suggest that virtio_net clearly identifies which part needs a specific allocation
and does its itself, instead of abusing the netdev_priv storage.
Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Eric Dumazet @ 2018-04-18 16:51 UTC (permalink / raw)
To: Mikulas Patocka, Eric Dumazet
Cc: Michael S. Tsirkin, netdev, linux-kernel, virtualization,
Eric Dumazet, Joby Poriyath, Ben Hutchings, David S. Miller
In-Reply-To: <alpine.LRH.2.02.1804181218270.19136@file01.intranet.prod.int.rdu2.redhat.com>
On 04/18/2018 09:44 AM, Mikulas Patocka wrote:
>
>
> On Wed, 18 Apr 2018, Eric Dumazet wrote:
>
>>
>>
>> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
>>> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
>>> fails (later patches change it to kvzalloc).
>>>
>>> The problem with this is that if the vzalloc function is actually used,
>>> virtio_net doesn't work (because it expects that the extra memory should
>>> be accessible with DMA-API and memory allocated with vzalloc isn't).
>>>
>>> This patch changes it back to kzalloc and adds a warning if the allocated
>>> size is too large (the allocation is unreliable in this case).
>>>
>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
>>>
>>> ---
>>> net/core/dev.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-2.6/net/core/dev.c
>>> ===================================================================
>>> --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
>>> +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
>>> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
>>> /* ensure 32-byte alignment of whole construct */
>>> alloc_size += NETDEV_ALIGN - 1;
>>>
>>> - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>>> + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
>>> + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>>> if (!p)
>>> return NULL;
>>>
>>>
>>
>> Since when a net_device needs to be in DMA zone ???
>>
>> I would rather fix virtio_net, this looks very suspect to me.
>>
>> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
>> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
>
> The structure net_device is followed by arbitrary driver-specific data
> (accessible with the function netdev_priv). And for virtio-net, these
> driver-specific data must be in DMA memory.
I get that, but how is the original xenvif problem will be solved ?
Your patch would add a bug in some other driver(s)
I suggest that virtio_net clearly identifies which part needs a specific allocation
and does its itself, instead of abusing the netdev_priv storage.
Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: David Miller @ 2018-04-18 17:46 UTC (permalink / raw)
To: mpatocka
Cc: eric.dumazet, mst, netdev, linux-kernel, virtualization, edumazet,
joby.poriyath, bhutchings
In-Reply-To: <alpine.LRH.2.02.1804181218270.19136@file01.intranet.prod.int.rdu2.redhat.com>
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)
> The structure net_device is followed by arbitrary driver-specific data
> (accessible with the function netdev_priv). And for virtio-net, these
> driver-specific data must be in DMA memory.
And we are saying that this assumption is wrong and needs to be
corrected.
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: David Miller @ 2018-04-18 17:47 UTC (permalink / raw)
To: eric.dumazet
Cc: mst, netdev, linux-kernel, virtualization, edumazet, mpatocka,
joby.poriyath, bhutchings
In-Reply-To: <5f4e1286-b79f-0b9f-9a30-47d7654f3889@gmail.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Apr 2018 09:51:25 -0700
> I suggest that virtio_net clearly identifies which part needs a specific allocation
> and does its itself, instead of abusing the netdev_priv storage.
>
> Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
+1
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Mikulas Patocka @ 2018-04-18 17:49 UTC (permalink / raw)
To: Eric Dumazet
Cc: Michael S. Tsirkin, netdev, linux-kernel, virtualization,
Eric Dumazet, Ben Hutchings, David S. Miller
In-Reply-To: <5f4e1286-b79f-0b9f-9a30-47d7654f3889@gmail.com>
On Wed, 18 Apr 2018, Eric Dumazet wrote:
>
>
> On 04/18/2018 09:44 AM, Mikulas Patocka wrote:
> >
> >
> > On Wed, 18 Apr 2018, Eric Dumazet wrote:
> >
> >>
> >>
> >> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> >>> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> >>> fails (later patches change it to kvzalloc).
> >>>
> >>> The problem with this is that if the vzalloc function is actually used,
> >>> virtio_net doesn't work (because it expects that the extra memory should
> >>> be accessible with DMA-API and memory allocated with vzalloc isn't).
> >>>
> >>> This patch changes it back to kzalloc and adds a warning if the allocated
> >>> size is too large (the allocation is unreliable in this case).
> >>>
> >>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >>> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> >>>
> >>> ---
> >>> net/core/dev.c | 3 ++-
> >>> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> Index: linux-2.6/net/core/dev.c
> >>> ===================================================================
> >>> --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.000000000 +0200
> >>> +++ linux-2.6/net/core/dev.c 2018-04-18 16:24:43.000000000 +0200
> >>> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> >>> /* ensure 32-byte alignment of whole construct */
> >>> alloc_size += NETDEV_ALIGN - 1;
> >>>
> >>> - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> >>> + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> >>> + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> >>> if (!p)
> >>> return NULL;
> >>>
> >>>
> >>
> >> Since when a net_device needs to be in DMA zone ???
> >>
> >> I would rather fix virtio_net, this looks very suspect to me.
> >>
> >> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
> >> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
> >
> > The structure net_device is followed by arbitrary driver-specific data
> > (accessible with the function netdev_priv). And for virtio-net, these
> > driver-specific data must be in DMA memory.
>
> I get that, but how is the original xenvif problem will be solved ?
>
> Your patch would add a bug in some other driver(s)
>
> I suggest that virtio_net clearly identifies which part needs a specific allocation
> and does its itself, instead of abusing the netdev_priv storage.
>
> Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
There are drivers that need to do DMA to driver-specific area.
And there are drivers that need driver-specific area larger than kmalloc
limit.
These are conflicting requirements and one of those drivers must be
changed.
I suggest to change the drivers that need large driver-specific area.
That's why I added the WARN_ON, so that they can be identified.
Mikulas
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Mikulas Patocka @ 2018-04-18 17:53 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, mst, netdev, linux-kernel, virtualization, edumazet,
bhutchings
In-Reply-To: <20180418.134651.2225112489265654270.davem@davemloft.net>
On Wed, 18 Apr 2018, David Miller wrote:
> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)
>
> > The structure net_device is followed by arbitrary driver-specific data
> > (accessible with the function netdev_priv). And for virtio-net, these
> > driver-specific data must be in DMA memory.
>
> And we are saying that this assumption is wrong and needs to be
> corrected.
So, try to find all the networking drivers that to DMA to the private
area.
The problem here is that kvzalloc usually returns DMA-able area, but it
may return non-DMA area rarely, if the memory is too fragmented. So, we
are in a situation, where some networking drivers will randomly fail. Go
and find them.
Mikulas
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Mikulas Patocka @ 2018-04-18 17:55 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, mst, netdev, linux-kernel, virtualization, edumazet,
joby.poriyath, bhutchings
In-Reply-To: <20180418.134721.283381724211219025.davem@davemloft.net>
On Wed, 18 Apr 2018, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 18 Apr 2018 09:51:25 -0700
>
> > I suggest that virtio_net clearly identifies which part needs a specific allocation
> > and does its itself, instead of abusing the netdev_priv storage.
> >
> > Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
>
> +1
And what about the other networking drivers that may do the same thing?
Mikulas
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Michael S. Tsirkin @ 2018-04-18 18:00 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, netdev, linux-kernel, virtualization, edumazet,
mpatocka, joby.poriyath, bhutchings
In-Reply-To: <20180418.134721.283381724211219025.davem@davemloft.net>
On Wed, Apr 18, 2018 at 01:47:21PM -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 18 Apr 2018 09:51:25 -0700
>
> > I suggest that virtio_net clearly identifies which part needs a specific allocation
> > and does its itself, instead of abusing the netdev_priv storage.
> >
> > Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
>
> +1
I can do this, but just FYI it's all of 16 bytes which is only mapped
for DMA temporarily - and not all of it - a byte here, a byte there.
The amount of hoops one has to jump through just to get 1 byte from
device nowdays is annoying.
--
MST
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Samudrala, Sridhar @ 2018-04-18 18:43 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <20180418092515.GB1989@nanopsycho>
On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>>>> This provides a generic interface for paravirtual drivers to listen
>>>> for netdev register/unregister/link change events from pci ethernet
>>>> devices with the same MAC and takeover their datapath. The notifier and
>>>> event handling code is based on the existing netvsc implementation.
>>>>
>>>> It exposes 2 sets of interfaces to the paravirtual drivers.
>>>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>>>> master netdev is created. The paravirtual driver registers each bypass
>>>> instance along with a set of ops to manage the slave events.
>>>> bypass_master_register()
>>>> bypass_master_unregister()
>>>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>>>> the bypass module provides interfaces to create/destroy additional master
>>>> netdev and all the slave events are managed internally.
>>>> bypass_master_create()
>>>> bypass_master_destroy()
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> ---
>>>> include/linux/netdevice.h | 14 +
>>>> include/net/bypass.h | 96 ++++++
>>>> net/Kconfig | 18 +
>>>> net/core/Makefile | 1 +
>>>> net/core/bypass.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> 5 files changed, 973 insertions(+)
>>>> create mode 100644 include/net/bypass.h
>>>> create mode 100644 net/core/bypass.c
>>>>
>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>> index cf44503ea81a..587293728f70 100644
>>>> --- a/include/linux/netdevice.h
>>>> +++ b/include/linux/netdevice.h
>>>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>>>> IFF_PHONY_HEADROOM = 1<<24,
>>>> IFF_MACSEC = 1<<25,
>>>> IFF_NO_RX_HANDLER = 1<<26,
>>>> + IFF_BYPASS = 1 << 27,
>>>> + IFF_BYPASS_SLAVE = 1 << 28,
>>> I wonder, why you don't follow the existing coding style... Also, please
>>> add these to into the comment above.
>> To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> to the existing coding style to be consistent.
> Please do.
>
>
>>>
>>>> };
>>>>
>>>> #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>>>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>>>> #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
>>>> #define IFF_MACSEC IFF_MACSEC
>>>> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>>>> +#define IFF_BYPASS IFF_BYPASS
>>>> +#define IFF_BYPASS_SLAVE IFF_BYPASS_SLAVE
>>>>
>>>> /**
>>>> * struct net_device - The DEVICE structure.
>>>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>>>> return dev->priv_flags & IFF_RXFH_CONFIGURED;
>>>> }
>>>>
>>>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>>>> +{
>>>> + return dev->priv_flags & IFF_BYPASS;
>>>> +}
>>>> +
>>>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>>>> +{
>>>> + return dev->priv_flags & IFF_BYPASS_SLAVE;
>>>> +}
>>>> +
>>>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>>>> static inline void netif_keep_dst(struct net_device *dev)
>>>> {
>>>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>>>> new file mode 100644
>>>> index 000000000000..86b02cb894cf
>>>> --- /dev/null
>>>> +++ b/include/net/bypass.h
>>>> @@ -0,0 +1,96 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/* Copyright (c) 2018, Intel Corporation. */
>>>> +
>>>> +#ifndef _NET_BYPASS_H
>>>> +#define _NET_BYPASS_H
>>>> +
>>>> +#include <linux/netdevice.h>
>>>> +
>>>> +struct bypass_ops {
>>>> + int (*slave_pre_register)(struct net_device *slave_netdev,
>>>> + struct net_device *bypass_netdev);
>>>> + int (*slave_join)(struct net_device *slave_netdev,
>>>> + struct net_device *bypass_netdev);
>>>> + int (*slave_pre_unregister)(struct net_device *slave_netdev,
>>>> + struct net_device *bypass_netdev);
>>>> + int (*slave_release)(struct net_device *slave_netdev,
>>>> + struct net_device *bypass_netdev);
>>>> + int (*slave_link_change)(struct net_device *slave_netdev,
>>>> + struct net_device *bypass_netdev);
>>>> + rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>>>> +};
>>>> +
>>>> +struct bypass_master {
>>>> + struct list_head list;
>>>> + struct net_device __rcu *bypass_netdev;
>>>> + struct bypass_ops __rcu *ops;
>>>> +};
>>>> +
>>>> +/* bypass state */
>>>> +struct bypass_info {
>>>> + /* passthru netdev with same MAC */
>>>> + struct net_device __rcu *active_netdev;
>>> You still use "active"/"backup" names which is highly misleading as
>>> it has completely different meaning that in bond for example.
>>> I noted that in my previous review already. Please change it.
>> I guess the issue is with only the 'active' name. 'backup' should be fine as it also
>> matches with the BACKUP feature bit we are adding to virtio_net.
> I think that "backup" is also misleading. Both "active" and "backup"
> mean a *state* of slaves. This should be named differently.
>
>
>
>> With regards to alternate names for 'active', you suggested 'stolen', but i
>> am not too happy with it.
>> netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> No. The netdev could be any netdevice. It does not have to be a "VF".
> I think "stolen" is quite appropriate since it describes the modus
> operandi. The bypass master steals some netdevice according to some
> match.
>
> But I don't insist on "stolen". Just sounds right.
We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
'backup' name is consistent.
The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
Will look for any suggestions in the next day or two. If i don't get any, i will go
with 'stolen'
<snip>
> +
> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> + struct bypass_ops **ops)
> +{
> + struct bypass_master *bypass_master;
> + struct net_device *bypass_netdev;
> +
> + spin_lock(&bypass_lock);
> + list_for_each_entry(bypass_master, &bypass_master_list, list) {
>>> As I wrote the last time, you don't need this list, spinlock.
>>> You can do just something like:
>>> for_each_net(net) {
>>> for_each_netdev(net, dev) {
>>> if (netif_is_bypass_master(dev)) {
>> This function returns the upper netdev as well as the ops associated
>> with that netdev.
>> bypass_master_list is a list of 'struct bypass_master' that associates
> Well, can't you have it in netdev priv?
We cannot do this for 2-netdev model as there is no bypass_netdev created.
>
>
>> 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> NULL for 3-netdev model.
> I see :(
>
>
>>
>>>
>>>
>>>
>>>> + bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>>>> + if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>>>> + *ops = rcu_dereference(bypass_master->ops);
>>> I don't see how rcu_dereference is ok here.
>>> 1) I don't see rcu_read_lock taken
>>> 2) Looks like bypass_master->ops has the same value across the whole
>>> existence.
>> We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> Yes. ops doesn't change.
> If it does not change, you can just access it directly.
>
>
>>>
>>>> + spin_unlock(&bypass_lock);
>>>> + return bypass_netdev;
>>>> + }
>>>> + }
>>>> + spin_unlock(&bypass_lock);
>>>> + return NULL;
>>>> +}
>>>> +
>>>> +static int bypass_slave_register(struct net_device *slave_netdev)
>>>> +{
>>>> + struct net_device *bypass_netdev;
>>>> + struct bypass_ops *bypass_ops;
>>>> + int ret, orig_mtu;
>>>> +
>>>> + ASSERT_RTNL();
>>>> +
>>>> + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> + &bypass_ops);
>>> For master, could you use word "master" in the variables so it is clear?
>>> Also, "dev" is fine instead of "netdev".
>>> Something like "bpmaster_dev"
>> bypass_master is of type struct bypass_master, bypass_netdev is of type struct net_device.
> I was trying to point out, that "bypass_netdev" represents a "master"
> netdev, yet it does not say master. That is why I suggested
> "bpmaster_dev"
>
>
>> I can change all _netdev suffixes to _dev to make the names shorter.
> ok.
>
>
>>
>>>
>>>> + if (!bypass_netdev)
>>>> + goto done;
>>>> +
>>>> + ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>>>> + bypass_ops);
>>>> + if (ret != 0)
>>> Just "if (ret)" will do. You have this on more places.
>> OK.
>>
>>
>>>
>>>> + goto done;
>>>> +
>>>> + ret = netdev_rx_handler_register(slave_netdev,
>>>> + bypass_ops ? bypass_ops->handle_frame :
>>>> + bypass_handle_frame, bypass_netdev);
>>>> + if (ret != 0) {
>>>> + netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>>>> + ret);
>>>> + goto done;
>>>> + }
>>>> +
>>>> + ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>>>> + if (ret != 0) {
>>>> + netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>>>> + bypass_netdev->name, ret);
>>>> + goto upper_link_failed;
>>>> + }
>>>> +
>>>> + slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>>>> +
>>>> + if (netif_running(bypass_netdev)) {
>>>> + ret = dev_open(slave_netdev);
>>>> + if (ret && (ret != -EBUSY)) {
>>>> + netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>>>> + slave_netdev->name, ret);
>>>> + goto err_interface_up;
>>>> + }
>>>> + }
>>>> +
>>>> + /* Align MTU of slave with master */
>>>> + orig_mtu = slave_netdev->mtu;
>>>> + ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>>>> + if (ret != 0) {
>>>> + netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>>>> + slave_netdev->name, bypass_netdev->mtu);
>>>> + goto err_set_mtu;
>>>> + }
>>>> +
>>>> + ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>>>> + if (ret != 0)
>>>> + goto err_join;
>>>> +
>>>> + call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>>>> +
>>>> + netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>>>> + slave_netdev->name);
>>>> +
>>>> + goto done;
>>>> +
>>>> +err_join:
>>>> + dev_set_mtu(slave_netdev, orig_mtu);
>>>> +err_set_mtu:
>>>> + dev_close(slave_netdev);
>>>> +err_interface_up:
>>>> + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +upper_link_failed:
>>>> + netdev_rx_handler_unregister(slave_netdev);
>>>> +done:
>>>> + return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>>>> + struct net_device *bypass_netdev,
>>>> + struct bypass_ops *bypass_ops)
>>>> +{
>>>> + struct net_device *backup_netdev, *active_netdev;
>>>> + struct bypass_info *bi;
>>>> +
>>>> + if (bypass_ops) {
>>>> + if (!bypass_ops->slave_pre_unregister)
>>>> + return -EINVAL;
>>>> +
>>>> + return bypass_ops->slave_pre_unregister(slave_netdev,
>>>> + bypass_netdev);
>>>> + }
>>>> +
>>>> + bi = netdev_priv(bypass_netdev);
>>>> + active_netdev = rtnl_dereference(bi->active_netdev);
>>>> + backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> + return -EINVAL;
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static int bypass_slave_release(struct net_device *slave_netdev,
>>>> + struct net_device *bypass_netdev,
>>>> + struct bypass_ops *bypass_ops)
>>>> +{
>>>> + struct net_device *backup_netdev, *active_netdev;
>>>> + struct bypass_info *bi;
>>>> +
>>>> + if (bypass_ops) {
>>>> + if (!bypass_ops->slave_release)
>>>> + return -EINVAL;
>>> I think it would be good to make the API to the driver more strict and
>>> have a separate set of ops for "active" and "backup" netdevices.
>>> That should stop people thinking about extending this to more slaves in
>>> the future.
>> We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> 'active' slave.
> I'm very well aware of that. I just thought that explicit ops for the
> two slaves would make this more clear.
>
>
>>
>>>
>>>
>>>> +
>>>> + return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>>>> + }
>>>> +
>>>> + bi = netdev_priv(bypass_netdev);
>>>> + active_netdev = rtnl_dereference(bi->active_netdev);
>>>> + backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> + if (slave_netdev == backup_netdev) {
>>>> + RCU_INIT_POINTER(bi->backup_netdev, NULL);
>>>> + } else {
>>>> + RCU_INIT_POINTER(bi->active_netdev, NULL);
>>>> + if (backup_netdev) {
>>>> + bypass_netdev->min_mtu = backup_netdev->min_mtu;
>>>> + bypass_netdev->max_mtu = backup_netdev->max_mtu;
>>>> + }
>>>> + }
>>>> +
>>>> + dev_put(slave_netdev);
>>>> +
>>>> + netdev_info(bypass_netdev, "bypass slave:%s released\n",
>>>> + slave_netdev->name);
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>>>> +{
>>>> + struct net_device *bypass_netdev;
>>>> + struct bypass_ops *bypass_ops;
>>>> + int ret;
>>>> +
>>>> + if (!netif_is_bypass_slave(slave_netdev))
>>>> + goto done;
>>>> +
>>>> + ASSERT_RTNL();
>>>> +
>>>> + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> + &bypass_ops);
>>>> + if (!bypass_netdev)
>>>> + goto done;
>>>> +
>>>> + ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>>>> + bypass_ops);
>>>> + if (ret != 0)
>>>> + goto done;
>>>> +
>>>> + netdev_rx_handler_unregister(slave_netdev);
>>>> + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +
>>>> + bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>>>> +
>>>> + netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>>>> + slave_netdev->name);
>>>> +
>>>> +done:
>>>> + return NOTIFY_DONE;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>>>> +
>>>> +static bool bypass_xmit_ready(struct net_device *dev)
>>>> +{
>>>> + return netif_running(dev) && netif_carrier_ok(dev);
>>>> +}
>>>> +
>>>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>>>> +{
>>>> + struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>>>> + struct bypass_ops *bypass_ops;
>>>> + struct bypass_info *bi;
>>>> +
>>>> + if (!netif_is_bypass_slave(slave_netdev))
>>>> + goto done;
>>>> +
>>>> + ASSERT_RTNL();
>>>> +
>>>> + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> + &bypass_ops);
>>>> + if (!bypass_netdev)
>>>> + goto done;
>>>> +
>>>> + if (bypass_ops) {
>>>> + if (!bypass_ops->slave_link_change)
>>>> + goto done;
>>>> +
>>>> + return bypass_ops->slave_link_change(slave_netdev,
>>>> + bypass_netdev);
>>>> + }
>>>> +
>>>> + if (!netif_running(bypass_netdev))
>>>> + return 0;
>>>> +
>>>> + bi = netdev_priv(bypass_netdev);
>>>> +
>>>> + active_netdev = rtnl_dereference(bi->active_netdev);
>>>> + backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> + goto done;
>>> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>>> above is enough.
>> I think we need this check to not allow events from a slave that is not
>> attached to this master but has the same MAC.
> Why do we need such events? Seems wrong to me.
We want to avoid events from a netdev that is mis-configured with the same MAC as
a bypass setup.
> Consider:
>
> bp1 bp2
> a1 b1 a2 b2
>
>
> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
We should not have 2 bypass configs with the same MAC.
I need to add a check in the bypass_master_register() to prevent this.
The above check is to avoid cases where we have
bp1(a1, b1) with mac1
and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> the order of creation.
> Let's say it will return bp1. Then when we have event for a2, the
> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>
>
> You cannot use bypass_master_get_bymac() here.
>
>
>
>>>
>>>> +
>>>> + if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>>>> + (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>>>> + netif_carrier_on(bypass_netdev);
>>>> + netif_tx_wake_all_queues(bypass_netdev);
>>>> + } else {
>>>> + netif_carrier_off(bypass_netdev);
>>>> + netif_tx_stop_all_queues(bypass_netdev);
>>>> + }
>>>> +
>>>> +done:
>>>> + return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static bool bypass_validate_event_dev(struct net_device *dev)
>>>> +{
>>>> + /* Skip parent events */
>>>> + if (netif_is_bypass_master(dev))
>>>> + return false;
>>>> +
>>>> + /* Avoid non-Ethernet type devices */
>>>> + if (dev->type != ARPHRD_ETHER)
>>>> + return false;
>>>> +
>>>> + /* Avoid Vlan dev with same MAC registering as VF */
>>>> + if (is_vlan_dev(dev))
>>>> + return false;
>>>> +
>>>> + /* Avoid Bonding master dev with same MAC registering as slave dev */
>>>> + if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>>> Yeah, this is certainly incorrect. One thing is, you should be using the
>>> helpers netif_is_bond_master().
>>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>>>
>>> You need to do it not by blacklisting, but with whitelisting. You need
>>> to whitelist VF devices. My port flavours patchset might help with this.
>> May be i can use netdev_has_lower_dev() helper to make sure that the slave
> I don't see such function in the code.
It is netdev_has_any_lower_dev(). I need to export it.
>
>
>> device is not an upper dev.
>> Can you point to your port flavours patchset? Is it upstream?
> I sent rfc couple of weeks ago:
> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-18 19:13 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <dd0d53f0-f5da-cb7d-f8f6-d0c8245eb3cf@intel.com>
Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > This provides a generic interface for paravirtual drivers to listen
>> > > > for netdev register/unregister/link change events from pci ethernet
>> > > > devices with the same MAC and takeover their datapath. The notifier and
>> > > > event handling code is based on the existing netvsc implementation.
>> > > >
>> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > > > master netdev is created. The paravirtual driver registers each bypass
>> > > > instance along with a set of ops to manage the slave events.
>> > > > bypass_master_register()
>> > > > bypass_master_unregister()
>> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> > > > the bypass module provides interfaces to create/destroy additional master
>> > > > netdev and all the slave events are managed internally.
>> > > > bypass_master_create()
>> > > > bypass_master_destroy()
>> > > >
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > ---
>> > > > include/linux/netdevice.h | 14 +
>> > > > include/net/bypass.h | 96 ++++++
>> > > > net/Kconfig | 18 +
>> > > > net/core/Makefile | 1 +
>> > > > net/core/bypass.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> > > > 5 files changed, 973 insertions(+)
>> > > > create mode 100644 include/net/bypass.h
>> > > > create mode 100644 net/core/bypass.c
>> > > >
>> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > > > index cf44503ea81a..587293728f70 100644
>> > > > --- a/include/linux/netdevice.h
>> > > > +++ b/include/linux/netdevice.h
>> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> > > > IFF_PHONY_HEADROOM = 1<<24,
>> > > > IFF_MACSEC = 1<<25,
>> > > > IFF_NO_RX_HANDLER = 1<<26,
>> > > > + IFF_BYPASS = 1 << 27,
>> > > > + IFF_BYPASS_SLAVE = 1 << 28,
>> > > I wonder, why you don't follow the existing coding style... Also, please
>> > > add these to into the comment above.
>> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> > to the existing coding style to be consistent.
>> Please do.
>>
>>
>> > >
>> > > > };
>> > > >
>> > > > #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > > > #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
>> > > > #define IFF_MACSEC IFF_MACSEC
>> > > > #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>> > > > +#define IFF_BYPASS IFF_BYPASS
>> > > > +#define IFF_BYPASS_SLAVE IFF_BYPASS_SLAVE
>> > > >
>> > > > /**
>> > > > * struct net_device - The DEVICE structure.
>> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> > > > return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > > > }
>> > > >
>> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> > > > +{
>> > > > + return dev->priv_flags & IFF_BYPASS;
>> > > > +}
>> > > > +
>> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > > > +{
>> > > > + return dev->priv_flags & IFF_BYPASS_SLAVE;
>> > > > +}
>> > > > +
>> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> > > > static inline void netif_keep_dst(struct net_device *dev)
>> > > > {
>> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> > > > new file mode 100644
>> > > > index 000000000000..86b02cb894cf
>> > > > --- /dev/null
>> > > > +++ b/include/net/bypass.h
>> > > > @@ -0,0 +1,96 @@
>> > > > +// SPDX-License-Identifier: GPL-2.0
>> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> > > > +
>> > > > +#ifndef _NET_BYPASS_H
>> > > > +#define _NET_BYPASS_H
>> > > > +
>> > > > +#include <linux/netdevice.h>
>> > > > +
>> > > > +struct bypass_ops {
>> > > > + int (*slave_pre_register)(struct net_device *slave_netdev,
>> > > > + struct net_device *bypass_netdev);
>> > > > + int (*slave_join)(struct net_device *slave_netdev,
>> > > > + struct net_device *bypass_netdev);
>> > > > + int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> > > > + struct net_device *bypass_netdev);
>> > > > + int (*slave_release)(struct net_device *slave_netdev,
>> > > > + struct net_device *bypass_netdev);
>> > > > + int (*slave_link_change)(struct net_device *slave_netdev,
>> > > > + struct net_device *bypass_netdev);
>> > > > + rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> > > > +};
>> > > > +
>> > > > +struct bypass_master {
>> > > > + struct list_head list;
>> > > > + struct net_device __rcu *bypass_netdev;
>> > > > + struct bypass_ops __rcu *ops;
>> > > > +};
>> > > > +
>> > > > +/* bypass state */
>> > > > +struct bypass_info {
>> > > > + /* passthru netdev with same MAC */
>> > > > + struct net_device __rcu *active_netdev;
>> > > You still use "active"/"backup" names which is highly misleading as
>> > > it has completely different meaning that in bond for example.
>> > > I noted that in my previous review already. Please change it.
>> > I guess the issue is with only the 'active' name. 'backup' should be fine as it also
>> > matches with the BACKUP feature bit we are adding to virtio_net.
>> I think that "backup" is also misleading. Both "active" and "backup"
>> mean a *state* of slaves. This should be named differently.
>>
>>
>>
>> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> > am not too happy with it.
>> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> No. The netdev could be any netdevice. It does not have to be a "VF".
>> I think "stolen" is quite appropriate since it describes the modus
>> operandi. The bypass master steals some netdevice according to some
>> match.
>>
>> But I don't insist on "stolen". Just sounds right.
>
>We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>'backup' name is consistent.
It perhaps makes sense from the view of virtio device. However, as I
described couple of times, for master/slave device the name "backup" is
highly misleading.
>
>The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>
>Will look for any suggestions in the next day or two. If i don't get any, i will go
>with 'stolen'
>
><snip>
>
>
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> + struct bypass_ops **ops)
>> +{
>> + struct bypass_master *bypass_master;
>> + struct net_device *bypass_netdev;
>> +
>> + spin_lock(&bypass_lock);
>> + list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> > > As I wrote the last time, you don't need this list, spinlock.
>> > > You can do just something like:
>> > > for_each_net(net) {
>> > > for_each_netdev(net, dev) {
>> > > if (netif_is_bypass_master(dev)) {
>> > This function returns the upper netdev as well as the ops associated
>> > with that netdev.
>> > bypass_master_list is a list of 'struct bypass_master' that associates
>> Well, can't you have it in netdev priv?
>
>We cannot do this for 2-netdev model as there is no bypass_netdev created.
Howcome? You have no master? I don't understand..
>
>>
>>
>> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> > NULL for 3-netdev model.
>> I see :(
>>
>>
>> >
>> > >
>> > >
>> > >
>> > > > + bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > > > + if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> > > > + *ops = rcu_dereference(bypass_master->ops);
>> > > I don't see how rcu_dereference is ok here.
>> > > 1) I don't see rcu_read_lock taken
>> > > 2) Looks like bypass_master->ops has the same value across the whole
>> > > existence.
>> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> > Yes. ops doesn't change.
>> If it does not change, you can just access it directly.
>>
>>
>> > >
>> > > > + spin_unlock(&bypass_lock);
>> > > > + return bypass_netdev;
>> > > > + }
>> > > > + }
>> > > > + spin_unlock(&bypass_lock);
>> > > > + return NULL;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> > > > +{
>> > > > + struct net_device *bypass_netdev;
>> > > > + struct bypass_ops *bypass_ops;
>> > > > + int ret, orig_mtu;
>> > > > +
>> > > > + ASSERT_RTNL();
>> > > > +
>> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > + &bypass_ops);
>> > > For master, could you use word "master" in the variables so it is clear?
>> > > Also, "dev" is fine instead of "netdev".
>> > > Something like "bpmaster_dev"
>> > bypass_master is of type struct bypass_master, bypass_netdev is of type struct net_device.
>> I was trying to point out, that "bypass_netdev" represents a "master"
>> netdev, yet it does not say master. That is why I suggested
>> "bpmaster_dev"
>>
>>
>> > I can change all _netdev suffixes to _dev to make the names shorter.
>> ok.
>>
>>
>> >
>> > >
>> > > > + if (!bypass_netdev)
>> > > > + goto done;
>> > > > +
>> > > > + ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> > > > + bypass_ops);
>> > > > + if (ret != 0)
>> > > Just "if (ret)" will do. You have this on more places.
>> > OK.
>> >
>> >
>> > >
>> > > > + goto done;
>> > > > +
>> > > > + ret = netdev_rx_handler_register(slave_netdev,
>> > > > + bypass_ops ? bypass_ops->handle_frame :
>> > > > + bypass_handle_frame, bypass_netdev);
>> > > > + if (ret != 0) {
>> > > > + netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> > > > + ret);
>> > > > + goto done;
>> > > > + }
>> > > > +
>> > > > + ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> > > > + if (ret != 0) {
>> > > > + netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> > > > + bypass_netdev->name, ret);
>> > > > + goto upper_link_failed;
>> > > > + }
>> > > > +
>> > > > + slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> > > > +
>> > > > + if (netif_running(bypass_netdev)) {
>> > > > + ret = dev_open(slave_netdev);
>> > > > + if (ret && (ret != -EBUSY)) {
>> > > > + netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> > > > + slave_netdev->name, ret);
>> > > > + goto err_interface_up;
>> > > > + }
>> > > > + }
>> > > > +
>> > > > + /* Align MTU of slave with master */
>> > > > + orig_mtu = slave_netdev->mtu;
>> > > > + ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> > > > + if (ret != 0) {
>> > > > + netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> > > > + slave_netdev->name, bypass_netdev->mtu);
>> > > > + goto err_set_mtu;
>> > > > + }
>> > > > +
>> > > > + ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> > > > + if (ret != 0)
>> > > > + goto err_join;
>> > > > +
>> > > > + call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> > > > +
>> > > > + netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> > > > + slave_netdev->name);
>> > > > +
>> > > > + goto done;
>> > > > +
>> > > > +err_join:
>> > > > + dev_set_mtu(slave_netdev, orig_mtu);
>> > > > +err_set_mtu:
>> > > > + dev_close(slave_netdev);
>> > > > +err_interface_up:
>> > > > + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +upper_link_failed:
>> > > > + netdev_rx_handler_unregister(slave_netdev);
>> > > > +done:
>> > > > + return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> > > > + struct net_device *bypass_netdev,
>> > > > + struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > + struct net_device *backup_netdev, *active_netdev;
>> > > > + struct bypass_info *bi;
>> > > > +
>> > > > + if (bypass_ops) {
>> > > > + if (!bypass_ops->slave_pre_unregister)
>> > > > + return -EINVAL;
>> > > > +
>> > > > + return bypass_ops->slave_pre_unregister(slave_netdev,
>> > > > + bypass_netdev);
>> > > > + }
>> > > > +
>> > > > + bi = netdev_priv(bypass_netdev);
>> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > + return -EINVAL;
>> > > > +
>> > > > + return 0;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> > > > + struct net_device *bypass_netdev,
>> > > > + struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > + struct net_device *backup_netdev, *active_netdev;
>> > > > + struct bypass_info *bi;
>> > > > +
>> > > > + if (bypass_ops) {
>> > > > + if (!bypass_ops->slave_release)
>> > > > + return -EINVAL;
>> > > I think it would be good to make the API to the driver more strict and
>> > > have a separate set of ops for "active" and "backup" netdevices.
>> > > That should stop people thinking about extending this to more slaves in
>> > > the future.
>> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> > 'active' slave.
>> I'm very well aware of that. I just thought that explicit ops for the
>> two slaves would make this more clear.
>>
>>
>> >
>> > >
>> > >
>> > > > +
>> > > > + return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> > > > + }
>> > > > +
>> > > > + bi = netdev_priv(bypass_netdev);
>> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > + if (slave_netdev == backup_netdev) {
>> > > > + RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> > > > + } else {
>> > > > + RCU_INIT_POINTER(bi->active_netdev, NULL);
>> > > > + if (backup_netdev) {
>> > > > + bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > > > + bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > > > + }
>> > > > + }
>> > > > +
>> > > > + dev_put(slave_netdev);
>> > > > +
>> > > > + netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> > > > + slave_netdev->name);
>> > > > +
>> > > > + return 0;
>> > > > +}
>> > > > +
>> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > > > +{
>> > > > + struct net_device *bypass_netdev;
>> > > > + struct bypass_ops *bypass_ops;
>> > > > + int ret;
>> > > > +
>> > > > + if (!netif_is_bypass_slave(slave_netdev))
>> > > > + goto done;
>> > > > +
>> > > > + ASSERT_RTNL();
>> > > > +
>> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > + &bypass_ops);
>> > > > + if (!bypass_netdev)
>> > > > + goto done;
>> > > > +
>> > > > + ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> > > > + bypass_ops);
>> > > > + if (ret != 0)
>> > > > + goto done;
>> > > > +
>> > > > + netdev_rx_handler_unregister(slave_netdev);
>> > > > + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +
>> > > > + bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +
>> > > > + netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> > > > + slave_netdev->name);
>> > > > +
>> > > > +done:
>> > > > + return NOTIFY_DONE;
>> > > > +}
>> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> > > > +
>> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> > > > +{
>> > > > + return netif_running(dev) && netif_carrier_ok(dev);
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> > > > +{
>> > > > + struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> > > > + struct bypass_ops *bypass_ops;
>> > > > + struct bypass_info *bi;
>> > > > +
>> > > > + if (!netif_is_bypass_slave(slave_netdev))
>> > > > + goto done;
>> > > > +
>> > > > + ASSERT_RTNL();
>> > > > +
>> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > + &bypass_ops);
>> > > > + if (!bypass_netdev)
>> > > > + goto done;
>> > > > +
>> > > > + if (bypass_ops) {
>> > > > + if (!bypass_ops->slave_link_change)
>> > > > + goto done;
>> > > > +
>> > > > + return bypass_ops->slave_link_change(slave_netdev,
>> > > > + bypass_netdev);
>> > > > + }
>> > > > +
>> > > > + if (!netif_running(bypass_netdev))
>> > > > + return 0;
>> > > > +
>> > > > + bi = netdev_priv(bypass_netdev);
>> > > > +
>> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > + goto done;
>> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> > > above is enough.
>> > I think we need this check to not allow events from a slave that is not
>> > attached to this master but has the same MAC.
>> Why do we need such events? Seems wrong to me.
>
>We want to avoid events from a netdev that is mis-configured with the same MAC as
>a bypass setup.
>
>> Consider:
>>
>> bp1 bp2
>> a1 b1 a2 b2
>>
>>
>> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>
>We should not have 2 bypass configs with the same MAC.
>I need to add a check in the bypass_master_register() to prevent this.
Mac can change, you would have to check in change as well. Feels odd
thought.
>
>The above check is to avoid cases where we have
>bp1(a1, b1) with mac1
>and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>
>> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> the order of creation.
>> Let's say it will return bp1. Then when we have event for a2, the
>> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>>
>>
>> You cannot use bypass_master_get_bymac() here.
>>
>>
>>
>> > >
>> > > > +
>> > > > + if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> > > > + (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> > > > + netif_carrier_on(bypass_netdev);
>> > > > + netif_tx_wake_all_queues(bypass_netdev);
>> > > > + } else {
>> > > > + netif_carrier_off(bypass_netdev);
>> > > > + netif_tx_stop_all_queues(bypass_netdev);
>> > > > + }
>> > > > +
>> > > > +done:
>> > > > + return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> > > > +{
>> > > > + /* Skip parent events */
>> > > > + if (netif_is_bypass_master(dev))
>> > > > + return false;
>> > > > +
>> > > > + /* Avoid non-Ethernet type devices */
>> > > > + if (dev->type != ARPHRD_ETHER)
>> > > > + return false;
>> > > > +
>> > > > + /* Avoid Vlan dev with same MAC registering as VF */
>> > > > + if (is_vlan_dev(dev))
>> > > > + return false;
>> > > > +
>> > > > + /* Avoid Bonding master dev with same MAC registering as slave dev */
>> > > > + if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> > > helpers netif_is_bond_master().
>> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> > >
>> > > You need to do it not by blacklisting, but with whitelisting. You need
>> > > to whitelist VF devices. My port flavours patchset might help with this.
>> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> I don't see such function in the code.
>
>It is netdev_has_any_lower_dev(). I need to export it.
Come on, you cannot use that. That would allow bonding without slaves,
but the slaves could be added later on.
What exactly you are trying to achieve by this?
>
>>
>>
>> > device is not an upper dev.
>> > Can you point to your port flavours patchset? Is it upstream?
>> I sent rfc couple of weeks ago:
>> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>
>
>
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Michael S. Tsirkin @ 2018-04-18 19:46 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180418191315.GA1922@nanopsycho>
On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > This provides a generic interface for paravirtual drivers to listen
> >> > > > for netdev register/unregister/link change events from pci ethernet
> >> > > > devices with the same MAC and takeover their datapath. The notifier and
> >> > > > event handling code is based on the existing netvsc implementation.
> >> > > >
> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
> >> > > > master netdev is created. The paravirtual driver registers each bypass
> >> > > > instance along with a set of ops to manage the slave events.
> >> > > > bypass_master_register()
> >> > > > bypass_master_unregister()
> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
> >> > > > the bypass module provides interfaces to create/destroy additional master
> >> > > > netdev and all the slave events are managed internally.
> >> > > > bypass_master_create()
> >> > > > bypass_master_destroy()
> >> > > >
> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > > > ---
> >> > > > include/linux/netdevice.h | 14 +
> >> > > > include/net/bypass.h | 96 ++++++
> >> > > > net/Kconfig | 18 +
> >> > > > net/core/Makefile | 1 +
> >> > > > net/core/bypass.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> >> > > > 5 files changed, 973 insertions(+)
> >> > > > create mode 100644 include/net/bypass.h
> >> > > > create mode 100644 net/core/bypass.c
> >> > > >
> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> > > > index cf44503ea81a..587293728f70 100644
> >> > > > --- a/include/linux/netdevice.h
> >> > > > +++ b/include/linux/netdevice.h
> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> >> > > > IFF_PHONY_HEADROOM = 1<<24,
> >> > > > IFF_MACSEC = 1<<25,
> >> > > > IFF_NO_RX_HANDLER = 1<<26,
> >> > > > + IFF_BYPASS = 1 << 27,
> >> > > > + IFF_BYPASS_SLAVE = 1 << 28,
> >> > > I wonder, why you don't follow the existing coding style... Also, please
> >> > > add these to into the comment above.
> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
> >> > to the existing coding style to be consistent.
> >> Please do.
> >>
> >>
> >> > >
> >> > > > };
> >> > > >
> >> > > > #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> >> > > > #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
> >> > > > #define IFF_MACSEC IFF_MACSEC
> >> > > > #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
> >> > > > +#define IFF_BYPASS IFF_BYPASS
> >> > > > +#define IFF_BYPASS_SLAVE IFF_BYPASS_SLAVE
> >> > > >
> >> > > > /**
> >> > > > * struct net_device - The DEVICE structure.
> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> >> > > > return dev->priv_flags & IFF_RXFH_CONFIGURED;
> >> > > > }
> >> > > >
> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
> >> > > > +{
> >> > > > + return dev->priv_flags & IFF_BYPASS;
> >> > > > +}
> >> > > > +
> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
> >> > > > +{
> >> > > > + return dev->priv_flags & IFF_BYPASS_SLAVE;
> >> > > > +}
> >> > > > +
> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> >> > > > static inline void netif_keep_dst(struct net_device *dev)
> >> > > > {
> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
> >> > > > new file mode 100644
> >> > > > index 000000000000..86b02cb894cf
> >> > > > --- /dev/null
> >> > > > +++ b/include/net/bypass.h
> >> > > > @@ -0,0 +1,96 @@
> >> > > > +// SPDX-License-Identifier: GPL-2.0
> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
> >> > > > +
> >> > > > +#ifndef _NET_BYPASS_H
> >> > > > +#define _NET_BYPASS_H
> >> > > > +
> >> > > > +#include <linux/netdevice.h>
> >> > > > +
> >> > > > +struct bypass_ops {
> >> > > > + int (*slave_pre_register)(struct net_device *slave_netdev,
> >> > > > + struct net_device *bypass_netdev);
> >> > > > + int (*slave_join)(struct net_device *slave_netdev,
> >> > > > + struct net_device *bypass_netdev);
> >> > > > + int (*slave_pre_unregister)(struct net_device *slave_netdev,
> >> > > > + struct net_device *bypass_netdev);
> >> > > > + int (*slave_release)(struct net_device *slave_netdev,
> >> > > > + struct net_device *bypass_netdev);
> >> > > > + int (*slave_link_change)(struct net_device *slave_netdev,
> >> > > > + struct net_device *bypass_netdev);
> >> > > > + rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
> >> > > > +};
> >> > > > +
> >> > > > +struct bypass_master {
> >> > > > + struct list_head list;
> >> > > > + struct net_device __rcu *bypass_netdev;
> >> > > > + struct bypass_ops __rcu *ops;
> >> > > > +};
> >> > > > +
> >> > > > +/* bypass state */
> >> > > > +struct bypass_info {
> >> > > > + /* passthru netdev with same MAC */
> >> > > > + struct net_device __rcu *active_netdev;
> >> > > You still use "active"/"backup" names which is highly misleading as
> >> > > it has completely different meaning that in bond for example.
> >> > > I noted that in my previous review already. Please change it.
> >> > I guess the issue is with only the 'active' name. 'backup' should be fine as it also
> >> > matches with the BACKUP feature bit we are adding to virtio_net.
> >> I think that "backup" is also misleading. Both "active" and "backup"
> >> mean a *state* of slaves. This should be named differently.
> >>
> >>
> >>
> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
> >> > am not too happy with it.
> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> >> No. The netdev could be any netdevice. It does not have to be a "VF".
> >> I think "stolen" is quite appropriate since it describes the modus
> >> operandi. The bypass master steals some netdevice according to some
> >> match.
> >>
> >> But I don't insist on "stolen". Just sounds right.
> >
> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
> >'backup' name is consistent.
>
> It perhaps makes sense from the view of virtio device. However, as I
> described couple of times, for master/slave device the name "backup" is
> highly misleading.
virtio is the backup. You are supposed to use another
(typically passthrough) device, if that fails use virtio.
It does seem appropriate to me. If you like, we can
change that to "standby". Active I don't like either. "main"?
In fact would failover be better than bypass?
>
> >
> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
> >
> >Will look for any suggestions in the next day or two. If i don't get any, i will go
> >with 'stolen'
> >
> ><snip>
> >
> >
> >> +
> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> >> + struct bypass_ops **ops)
> >> +{
> >> + struct bypass_master *bypass_master;
> >> + struct net_device *bypass_netdev;
> >> +
> >> + spin_lock(&bypass_lock);
> >> + list_for_each_entry(bypass_master, &bypass_master_list, list) {
> >> > > As I wrote the last time, you don't need this list, spinlock.
> >> > > You can do just something like:
> >> > > for_each_net(net) {
> >> > > for_each_netdev(net, dev) {
> >> > > if (netif_is_bypass_master(dev)) {
> >> > This function returns the upper netdev as well as the ops associated
> >> > with that netdev.
> >> > bypass_master_list is a list of 'struct bypass_master' that associates
> >> Well, can't you have it in netdev priv?
> >
> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
>
> Howcome? You have no master? I don't understand..
>
>
>
> >
> >>
> >>
> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
> >> > NULL for 3-netdev model.
> >> I see :(
> >>
> >>
> >> >
> >> > >
> >> > >
> >> > >
> >> > > > + bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
> >> > > > + if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
> >> > > > + *ops = rcu_dereference(bypass_master->ops);
> >> > > I don't see how rcu_dereference is ok here.
> >> > > 1) I don't see rcu_read_lock taken
> >> > > 2) Looks like bypass_master->ops has the same value across the whole
> >> > > existence.
> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
> >> > Yes. ops doesn't change.
> >> If it does not change, you can just access it directly.
> >>
> >>
> >> > >
> >> > > > + spin_unlock(&bypass_lock);
> >> > > > + return bypass_netdev;
> >> > > > + }
> >> > > > + }
> >> > > > + spin_unlock(&bypass_lock);
> >> > > > + return NULL;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > + struct net_device *bypass_netdev;
> >> > > > + struct bypass_ops *bypass_ops;
> >> > > > + int ret, orig_mtu;
> >> > > > +
> >> > > > + ASSERT_RTNL();
> >> > > > +
> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > + &bypass_ops);
> >> > > For master, could you use word "master" in the variables so it is clear?
> >> > > Also, "dev" is fine instead of "netdev".
> >> > > Something like "bpmaster_dev"
> >> > bypass_master is of type struct bypass_master, bypass_netdev is of type struct net_device.
> >> I was trying to point out, that "bypass_netdev" represents a "master"
> >> netdev, yet it does not say master. That is why I suggested
> >> "bpmaster_dev"
> >>
> >>
> >> > I can change all _netdev suffixes to _dev to make the names shorter.
> >> ok.
> >>
> >>
> >> >
> >> > >
> >> > > > + if (!bypass_netdev)
> >> > > > + goto done;
> >> > > > +
> >> > > > + ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
> >> > > > + bypass_ops);
> >> > > > + if (ret != 0)
> >> > > Just "if (ret)" will do. You have this on more places.
> >> > OK.
> >> >
> >> >
> >> > >
> >> > > > + goto done;
> >> > > > +
> >> > > > + ret = netdev_rx_handler_register(slave_netdev,
> >> > > > + bypass_ops ? bypass_ops->handle_frame :
> >> > > > + bypass_handle_frame, bypass_netdev);
> >> > > > + if (ret != 0) {
> >> > > > + netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
> >> > > > + ret);
> >> > > > + goto done;
> >> > > > + }
> >> > > > +
> >> > > > + ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
> >> > > > + if (ret != 0) {
> >> > > > + netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
> >> > > > + bypass_netdev->name, ret);
> >> > > > + goto upper_link_failed;
> >> > > > + }
> >> > > > +
> >> > > > + slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > + if (netif_running(bypass_netdev)) {
> >> > > > + ret = dev_open(slave_netdev);
> >> > > > + if (ret && (ret != -EBUSY)) {
> >> > > > + netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
> >> > > > + slave_netdev->name, ret);
> >> > > > + goto err_interface_up;
> >> > > > + }
> >> > > > + }
> >> > > > +
> >> > > > + /* Align MTU of slave with master */
> >> > > > + orig_mtu = slave_netdev->mtu;
> >> > > > + ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
> >> > > > + if (ret != 0) {
> >> > > > + netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
> >> > > > + slave_netdev->name, bypass_netdev->mtu);
> >> > > > + goto err_set_mtu;
> >> > > > + }
> >> > > > +
> >> > > > + ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > + if (ret != 0)
> >> > > > + goto err_join;
> >> > > > +
> >> > > > + call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
> >> > > > +
> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s registered\n",
> >> > > > + slave_netdev->name);
> >> > > > +
> >> > > > + goto done;
> >> > > > +
> >> > > > +err_join:
> >> > > > + dev_set_mtu(slave_netdev, orig_mtu);
> >> > > > +err_set_mtu:
> >> > > > + dev_close(slave_netdev);
> >> > > > +err_interface_up:
> >> > > > + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +upper_link_failed:
> >> > > > + netdev_rx_handler_unregister(slave_netdev);
> >> > > > +done:
> >> > > > + return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
> >> > > > + struct net_device *bypass_netdev,
> >> > > > + struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > + struct net_device *backup_netdev, *active_netdev;
> >> > > > + struct bypass_info *bi;
> >> > > > +
> >> > > > + if (bypass_ops) {
> >> > > > + if (!bypass_ops->slave_pre_unregister)
> >> > > > + return -EINVAL;
> >> > > > +
> >> > > > + return bypass_ops->slave_pre_unregister(slave_netdev,
> >> > > > + bypass_netdev);
> >> > > > + }
> >> > > > +
> >> > > > + bi = netdev_priv(bypass_netdev);
> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > + return -EINVAL;
> >> > > > +
> >> > > > + return 0;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
> >> > > > + struct net_device *bypass_netdev,
> >> > > > + struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > + struct net_device *backup_netdev, *active_netdev;
> >> > > > + struct bypass_info *bi;
> >> > > > +
> >> > > > + if (bypass_ops) {
> >> > > > + if (!bypass_ops->slave_release)
> >> > > > + return -EINVAL;
> >> > > I think it would be good to make the API to the driver more strict and
> >> > > have a separate set of ops for "active" and "backup" netdevices.
> >> > > That should stop people thinking about extending this to more slaves in
> >> > > the future.
> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
> >> > 'active' slave.
> >> I'm very well aware of that. I just thought that explicit ops for the
> >> two slaves would make this more clear.
> >>
> >>
> >> >
> >> > >
> >> > >
> >> > > > +
> >> > > > + return bypass_ops->slave_release(slave_netdev, bypass_netdev);
> >> > > > + }
> >> > > > +
> >> > > > + bi = netdev_priv(bypass_netdev);
> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > + if (slave_netdev == backup_netdev) {
> >> > > > + RCU_INIT_POINTER(bi->backup_netdev, NULL);
> >> > > > + } else {
> >> > > > + RCU_INIT_POINTER(bi->active_netdev, NULL);
> >> > > > + if (backup_netdev) {
> >> > > > + bypass_netdev->min_mtu = backup_netdev->min_mtu;
> >> > > > + bypass_netdev->max_mtu = backup_netdev->max_mtu;
> >> > > > + }
> >> > > > + }
> >> > > > +
> >> > > > + dev_put(slave_netdev);
> >> > > > +
> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s released\n",
> >> > > > + slave_netdev->name);
> >> > > > +
> >> > > > + return 0;
> >> > > > +}
> >> > > > +
> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > + struct net_device *bypass_netdev;
> >> > > > + struct bypass_ops *bypass_ops;
> >> > > > + int ret;
> >> > > > +
> >> > > > + if (!netif_is_bypass_slave(slave_netdev))
> >> > > > + goto done;
> >> > > > +
> >> > > > + ASSERT_RTNL();
> >> > > > +
> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > + &bypass_ops);
> >> > > > + if (!bypass_netdev)
> >> > > > + goto done;
> >> > > > +
> >> > > > + ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
> >> > > > + bypass_ops);
> >> > > > + if (ret != 0)
> >> > > > + goto done;
> >> > > > +
> >> > > > + netdev_rx_handler_unregister(slave_netdev);
> >> > > > + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > + bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +
> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
> >> > > > + slave_netdev->name);
> >> > > > +
> >> > > > +done:
> >> > > > + return NOTIFY_DONE;
> >> > > > +}
> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
> >> > > > +
> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
> >> > > > +{
> >> > > > + return netif_running(dev) && netif_carrier_ok(dev);
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > + struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
> >> > > > + struct bypass_ops *bypass_ops;
> >> > > > + struct bypass_info *bi;
> >> > > > +
> >> > > > + if (!netif_is_bypass_slave(slave_netdev))
> >> > > > + goto done;
> >> > > > +
> >> > > > + ASSERT_RTNL();
> >> > > > +
> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > + &bypass_ops);
> >> > > > + if (!bypass_netdev)
> >> > > > + goto done;
> >> > > > +
> >> > > > + if (bypass_ops) {
> >> > > > + if (!bypass_ops->slave_link_change)
> >> > > > + goto done;
> >> > > > +
> >> > > > + return bypass_ops->slave_link_change(slave_netdev,
> >> > > > + bypass_netdev);
> >> > > > + }
> >> > > > +
> >> > > > + if (!netif_running(bypass_netdev))
> >> > > > + return 0;
> >> > > > +
> >> > > > + bi = netdev_priv(bypass_netdev);
> >> > > > +
> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > + goto done;
> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> >> > > above is enough.
> >> > I think we need this check to not allow events from a slave that is not
> >> > attached to this master but has the same MAC.
> >> Why do we need such events? Seems wrong to me.
> >
> >We want to avoid events from a netdev that is mis-configured with the same MAC as
> >a bypass setup.
> >
> >> Consider:
> >>
> >> bp1 bp2
> >> a1 b1 a2 b2
> >>
> >>
> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
> >
> >We should not have 2 bypass configs with the same MAC.
> >I need to add a check in the bypass_master_register() to prevent this.
>
> Mac can change, you would have to check in change as well. Feels odd
> thought.
>
>
> >
> >The above check is to avoid cases where we have
> >bp1(a1, b1) with mac1
> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
> >
> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> >> the order of creation.
> >> Let's say it will return bp1. Then when we have event for a2, the
> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
> >>
> >>
> >> You cannot use bypass_master_get_bymac() here.
> >>
> >>
> >>
> >> > >
> >> > > > +
> >> > > > + if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
> >> > > > + (backup_netdev && bypass_xmit_ready(backup_netdev))) {
> >> > > > + netif_carrier_on(bypass_netdev);
> >> > > > + netif_tx_wake_all_queues(bypass_netdev);
> >> > > > + } else {
> >> > > > + netif_carrier_off(bypass_netdev);
> >> > > > + netif_tx_stop_all_queues(bypass_netdev);
> >> > > > + }
> >> > > > +
> >> > > > +done:
> >> > > > + return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
> >> > > > +{
> >> > > > + /* Skip parent events */
> >> > > > + if (netif_is_bypass_master(dev))
> >> > > > + return false;
> >> > > > +
> >> > > > + /* Avoid non-Ethernet type devices */
> >> > > > + if (dev->type != ARPHRD_ETHER)
> >> > > > + return false;
> >> > > > +
> >> > > > + /* Avoid Vlan dev with same MAC registering as VF */
> >> > > > + if (is_vlan_dev(dev))
> >> > > > + return false;
> >> > > > +
> >> > > > + /* Avoid Bonding master dev with same MAC registering as slave dev */
> >> > > > + if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
> >> > > helpers netif_is_bond_master().
> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
> >> > >
> >> > > You need to do it not by blacklisting, but with whitelisting. You need
> >> > > to whitelist VF devices. My port flavours patchset might help with this.
> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
> >> I don't see such function in the code.
> >
> >It is netdev_has_any_lower_dev(). I need to export it.
>
> Come on, you cannot use that. That would allow bonding without slaves,
> but the slaves could be added later on.
>
> What exactly you are trying to achieve by this?
>
>
> >
> >>
> >>
> >> > device is not an upper dev.
> >> > Can you point to your port flavours patchset? Is it upstream?
> >> I sent rfc couple of weeks ago:
> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
> >
> >
> >
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-18 20:32 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180418222309-mutt-send-email-mst@kernel.org>
Wed, Apr 18, 2018 at 09:46:04PM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > This provides a generic interface for paravirtual drivers to listen
>> >> > > > for netdev register/unregister/link change events from pci ethernet
>> >> > > > devices with the same MAC and takeover their datapath. The notifier and
>> >> > > > event handling code is based on the existing netvsc implementation.
>> >> > > >
>> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> >> > > > master netdev is created. The paravirtual driver registers each bypass
>> >> > > > instance along with a set of ops to manage the slave events.
>> >> > > > bypass_master_register()
>> >> > > > bypass_master_unregister()
>> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> >> > > > the bypass module provides interfaces to create/destroy additional master
>> >> > > > netdev and all the slave events are managed internally.
>> >> > > > bypass_master_create()
>> >> > > > bypass_master_destroy()
>> >> > > >
>> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > > > ---
>> >> > > > include/linux/netdevice.h | 14 +
>> >> > > > include/net/bypass.h | 96 ++++++
>> >> > > > net/Kconfig | 18 +
>> >> > > > net/core/Makefile | 1 +
>> >> > > > net/core/bypass.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > 5 files changed, 973 insertions(+)
>> >> > > > create mode 100644 include/net/bypass.h
>> >> > > > create mode 100644 net/core/bypass.c
>> >> > > >
>> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> >> > > > index cf44503ea81a..587293728f70 100644
>> >> > > > --- a/include/linux/netdevice.h
>> >> > > > +++ b/include/linux/netdevice.h
>> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> >> > > > IFF_PHONY_HEADROOM = 1<<24,
>> >> > > > IFF_MACSEC = 1<<25,
>> >> > > > IFF_NO_RX_HANDLER = 1<<26,
>> >> > > > + IFF_BYPASS = 1 << 27,
>> >> > > > + IFF_BYPASS_SLAVE = 1 << 28,
>> >> > > I wonder, why you don't follow the existing coding style... Also, please
>> >> > > add these to into the comment above.
>> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> >> > to the existing coding style to be consistent.
>> >> Please do.
>> >>
>> >>
>> >> > >
>> >> > > > };
>> >> > > >
>> >> > > > #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> >> > > > #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
>> >> > > > #define IFF_MACSEC IFF_MACSEC
>> >> > > > #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>> >> > > > +#define IFF_BYPASS IFF_BYPASS
>> >> > > > +#define IFF_BYPASS_SLAVE IFF_BYPASS_SLAVE
>> >> > > >
>> >> > > > /**
>> >> > > > * struct net_device - The DEVICE structure.
>> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> >> > > > return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> >> > > > }
>> >> > > >
>> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> >> > > > +{
>> >> > > > + return dev->priv_flags & IFF_BYPASS;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> >> > > > +{
>> >> > > > + return dev->priv_flags & IFF_BYPASS_SLAVE;
>> >> > > > +}
>> >> > > > +
>> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> >> > > > static inline void netif_keep_dst(struct net_device *dev)
>> >> > > > {
>> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> >> > > > new file mode 100644
>> >> > > > index 000000000000..86b02cb894cf
>> >> > > > --- /dev/null
>> >> > > > +++ b/include/net/bypass.h
>> >> > > > @@ -0,0 +1,96 @@
>> >> > > > +// SPDX-License-Identifier: GPL-2.0
>> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> >> > > > +
>> >> > > > +#ifndef _NET_BYPASS_H
>> >> > > > +#define _NET_BYPASS_H
>> >> > > > +
>> >> > > > +#include <linux/netdevice.h>
>> >> > > > +
>> >> > > > +struct bypass_ops {
>> >> > > > + int (*slave_pre_register)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + int (*slave_join)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + int (*slave_release)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + int (*slave_link_change)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> >> > > > +};
>> >> > > > +
>> >> > > > +struct bypass_master {
>> >> > > > + struct list_head list;
>> >> > > > + struct net_device __rcu *bypass_netdev;
>> >> > > > + struct bypass_ops __rcu *ops;
>> >> > > > +};
>> >> > > > +
>> >> > > > +/* bypass state */
>> >> > > > +struct bypass_info {
>> >> > > > + /* passthru netdev with same MAC */
>> >> > > > + struct net_device __rcu *active_netdev;
>> >> > > You still use "active"/"backup" names which is highly misleading as
>> >> > > it has completely different meaning that in bond for example.
>> >> > > I noted that in my previous review already. Please change it.
>> >> > I guess the issue is with only the 'active' name. 'backup' should be fine as it also
>> >> > matches with the BACKUP feature bit we are adding to virtio_net.
>> >> I think that "backup" is also misleading. Both "active" and "backup"
>> >> mean a *state* of slaves. This should be named differently.
>> >>
>> >>
>> >>
>> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> > am not too happy with it.
>> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> I think "stolen" is quite appropriate since it describes the modus
>> >> operandi. The bypass master steals some netdevice according to some
>> >> match.
>> >>
>> >> But I don't insist on "stolen". Just sounds right.
>> >
>> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >'backup' name is consistent.
>>
>> It perhaps makes sense from the view of virtio device. However, as I
>> described couple of times, for master/slave device the name "backup" is
>> highly misleading.
>
>virtio is the backup. You are supposed to use another
>(typically passthrough) device, if that fails use virtio.
>It does seem appropriate to me. If you like, we can
>change that to "standby". Active I don't like either. "main"?
Sounds much better, yes.
>
>In fact would failover be better than bypass?
Also, much better.
>
>
>>
>> >
>> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> >
>> >Will look for any suggestions in the next day or two. If i don't get any, i will go
>> >with 'stolen'
>> >
>> ><snip>
>> >
>> >
>> >> +
>> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> >> + struct bypass_ops **ops)
>> >> +{
>> >> + struct bypass_master *bypass_master;
>> >> + struct net_device *bypass_netdev;
>> >> +
>> >> + spin_lock(&bypass_lock);
>> >> + list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> >> > > As I wrote the last time, you don't need this list, spinlock.
>> >> > > You can do just something like:
>> >> > > for_each_net(net) {
>> >> > > for_each_netdev(net, dev) {
>> >> > > if (netif_is_bypass_master(dev)) {
>> >> > This function returns the upper netdev as well as the ops associated
>> >> > with that netdev.
>> >> > bypass_master_list is a list of 'struct bypass_master' that associates
>> >> Well, can't you have it in netdev priv?
>> >
>> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
>>
>> Howcome? You have no master? I don't understand..
>>
>>
>>
>> >
>> >>
>> >>
>> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> >> > NULL for 3-netdev model.
>> >> I see :(
>> >>
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > >
>> >> > > > + bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> >> > > > + if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> >> > > > + *ops = rcu_dereference(bypass_master->ops);
>> >> > > I don't see how rcu_dereference is ok here.
>> >> > > 1) I don't see rcu_read_lock taken
>> >> > > 2) Looks like bypass_master->ops has the same value across the whole
>> >> > > existence.
>> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> >> > Yes. ops doesn't change.
>> >> If it does not change, you can just access it directly.
>> >>
>> >>
>> >> > >
>> >> > > > + spin_unlock(&bypass_lock);
>> >> > > > + return bypass_netdev;
>> >> > > > + }
>> >> > > > + }
>> >> > > > + spin_unlock(&bypass_lock);
>> >> > > > + return NULL;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > + struct net_device *bypass_netdev;
>> >> > > > + struct bypass_ops *bypass_ops;
>> >> > > > + int ret, orig_mtu;
>> >> > > > +
>> >> > > > + ASSERT_RTNL();
>> >> > > > +
>> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > + &bypass_ops);
>> >> > > For master, could you use word "master" in the variables so it is clear?
>> >> > > Also, "dev" is fine instead of "netdev".
>> >> > > Something like "bpmaster_dev"
>> >> > bypass_master is of type struct bypass_master, bypass_netdev is of type struct net_device.
>> >> I was trying to point out, that "bypass_netdev" represents a "master"
>> >> netdev, yet it does not say master. That is why I suggested
>> >> "bpmaster_dev"
>> >>
>> >>
>> >> > I can change all _netdev suffixes to _dev to make the names shorter.
>> >> ok.
>> >>
>> >>
>> >> >
>> >> > >
>> >> > > > + if (!bypass_netdev)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> >> > > > + bypass_ops);
>> >> > > > + if (ret != 0)
>> >> > > Just "if (ret)" will do. You have this on more places.
>> >> > OK.
>> >> >
>> >> >
>> >> > >
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ret = netdev_rx_handler_register(slave_netdev,
>> >> > > > + bypass_ops ? bypass_ops->handle_frame :
>> >> > > > + bypass_handle_frame, bypass_netdev);
>> >> > > > + if (ret != 0) {
>> >> > > > + netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> >> > > > + ret);
>> >> > > > + goto done;
>> >> > > > + }
>> >> > > > +
>> >> > > > + ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> >> > > > + if (ret != 0) {
>> >> > > > + netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> >> > > > + bypass_netdev->name, ret);
>> >> > > > + goto upper_link_failed;
>> >> > > > + }
>> >> > > > +
>> >> > > > + slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > + if (netif_running(bypass_netdev)) {
>> >> > > > + ret = dev_open(slave_netdev);
>> >> > > > + if (ret && (ret != -EBUSY)) {
>> >> > > > + netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> >> > > > + slave_netdev->name, ret);
>> >> > > > + goto err_interface_up;
>> >> > > > + }
>> >> > > > + }
>> >> > > > +
>> >> > > > + /* Align MTU of slave with master */
>> >> > > > + orig_mtu = slave_netdev->mtu;
>> >> > > > + ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> >> > > > + if (ret != 0) {
>> >> > > > + netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> >> > > > + slave_netdev->name, bypass_netdev->mtu);
>> >> > > > + goto err_set_mtu;
>> >> > > > + }
>> >> > > > +
>> >> > > > + ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > + if (ret != 0)
>> >> > > > + goto err_join;
>> >> > > > +
>> >> > > > + call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> >> > > > +
>> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> >> > > > + slave_netdev->name);
>> >> > > > +
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > +err_join:
>> >> > > > + dev_set_mtu(slave_netdev, orig_mtu);
>> >> > > > +err_set_mtu:
>> >> > > > + dev_close(slave_netdev);
>> >> > > > +err_interface_up:
>> >> > > > + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +upper_link_failed:
>> >> > > > + netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +done:
>> >> > > > + return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev,
>> >> > > > + struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > + struct net_device *backup_netdev, *active_netdev;
>> >> > > > + struct bypass_info *bi;
>> >> > > > +
>> >> > > > + if (bypass_ops) {
>> >> > > > + if (!bypass_ops->slave_pre_unregister)
>> >> > > > + return -EINVAL;
>> >> > > > +
>> >> > > > + return bypass_ops->slave_pre_unregister(slave_netdev,
>> >> > > > + bypass_netdev);
>> >> > > > + }
>> >> > > > +
>> >> > > > + bi = netdev_priv(bypass_netdev);
>> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > + return -EINVAL;
>> >> > > > +
>> >> > > > + return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev,
>> >> > > > + struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > + struct net_device *backup_netdev, *active_netdev;
>> >> > > > + struct bypass_info *bi;
>> >> > > > +
>> >> > > > + if (bypass_ops) {
>> >> > > > + if (!bypass_ops->slave_release)
>> >> > > > + return -EINVAL;
>> >> > > I think it would be good to make the API to the driver more strict and
>> >> > > have a separate set of ops for "active" and "backup" netdevices.
>> >> > > That should stop people thinking about extending this to more slaves in
>> >> > > the future.
>> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> >> > 'active' slave.
>> >> I'm very well aware of that. I just thought that explicit ops for the
>> >> two slaves would make this more clear.
>> >>
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > > > +
>> >> > > > + return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> >> > > > + }
>> >> > > > +
>> >> > > > + bi = netdev_priv(bypass_netdev);
>> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > + if (slave_netdev == backup_netdev) {
>> >> > > > + RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> >> > > > + } else {
>> >> > > > + RCU_INIT_POINTER(bi->active_netdev, NULL);
>> >> > > > + if (backup_netdev) {
>> >> > > > + bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> >> > > > + bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> >> > > > + }
>> >> > > > + }
>> >> > > > +
>> >> > > > + dev_put(slave_netdev);
>> >> > > > +
>> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> >> > > > + slave_netdev->name);
>> >> > > > +
>> >> > > > + return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > + struct net_device *bypass_netdev;
>> >> > > > + struct bypass_ops *bypass_ops;
>> >> > > > + int ret;
>> >> > > > +
>> >> > > > + if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ASSERT_RTNL();
>> >> > > > +
>> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > + &bypass_ops);
>> >> > > > + if (!bypass_netdev)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> >> > > > + bypass_ops);
>> >> > > > + if (ret != 0)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + netdev_rx_handler_unregister(slave_netdev);
>> >> > > > + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > + bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +
>> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> >> > > > + slave_netdev->name);
>> >> > > > +
>> >> > > > +done:
>> >> > > > + return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> >> > > > +
>> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> >> > > > +{
>> >> > > > + return netif_running(dev) && netif_carrier_ok(dev);
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > + struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> >> > > > + struct bypass_ops *bypass_ops;
>> >> > > > + struct bypass_info *bi;
>> >> > > > +
>> >> > > > + if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ASSERT_RTNL();
>> >> > > > +
>> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > + &bypass_ops);
>> >> > > > + if (!bypass_netdev)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + if (bypass_ops) {
>> >> > > > + if (!bypass_ops->slave_link_change)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + return bypass_ops->slave_link_change(slave_netdev,
>> >> > > > + bypass_netdev);
>> >> > > > + }
>> >> > > > +
>> >> > > > + if (!netif_running(bypass_netdev))
>> >> > > > + return 0;
>> >> > > > +
>> >> > > > + bi = netdev_priv(bypass_netdev);
>> >> > > > +
>> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > + goto done;
>> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> >> > > above is enough.
>> >> > I think we need this check to not allow events from a slave that is not
>> >> > attached to this master but has the same MAC.
>> >> Why do we need such events? Seems wrong to me.
>> >
>> >We want to avoid events from a netdev that is mis-configured with the same MAC as
>> >a bypass setup.
>> >
>> >> Consider:
>> >>
>> >> bp1 bp2
>> >> a1 b1 a2 b2
>> >>
>> >>
>> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>> >
>> >We should not have 2 bypass configs with the same MAC.
>> >I need to add a check in the bypass_master_register() to prevent this.
>>
>> Mac can change, you would have to check in change as well. Feels odd
>> thought.
>>
>>
>> >
>> >The above check is to avoid cases where we have
>> >bp1(a1, b1) with mac1
>> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>> >
>> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> >> the order of creation.
>> >> Let's say it will return bp1. Then when we have event for a2, the
>> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> >>
>> >>
>> >> You cannot use bypass_master_get_bymac() here.
>> >>
>> >>
>> >>
>> >> > >
>> >> > > > +
>> >> > > > + if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> >> > > > + (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> >> > > > + netif_carrier_on(bypass_netdev);
>> >> > > > + netif_tx_wake_all_queues(bypass_netdev);
>> >> > > > + } else {
>> >> > > > + netif_carrier_off(bypass_netdev);
>> >> > > > + netif_tx_stop_all_queues(bypass_netdev);
>> >> > > > + }
>> >> > > > +
>> >> > > > +done:
>> >> > > > + return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> >> > > > +{
>> >> > > > + /* Skip parent events */
>> >> > > > + if (netif_is_bypass_master(dev))
>> >> > > > + return false;
>> >> > > > +
>> >> > > > + /* Avoid non-Ethernet type devices */
>> >> > > > + if (dev->type != ARPHRD_ETHER)
>> >> > > > + return false;
>> >> > > > +
>> >> > > > + /* Avoid Vlan dev with same MAC registering as VF */
>> >> > > > + if (is_vlan_dev(dev))
>> >> > > > + return false;
>> >> > > > +
>> >> > > > + /* Avoid Bonding master dev with same MAC registering as slave dev */
>> >> > > > + if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> >> > > helpers netif_is_bond_master().
>> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> >> > >
>> >> > > You need to do it not by blacklisting, but with whitelisting. You need
>> >> > > to whitelist VF devices. My port flavours patchset might help with this.
>> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> >> I don't see such function in the code.
>> >
>> >It is netdev_has_any_lower_dev(). I need to export it.
>>
>> Come on, you cannot use that. That would allow bonding without slaves,
>> but the slaves could be added later on.
>>
>> What exactly you are trying to achieve by this?
>>
>>
>> >
>> >>
>> >>
>> >> > device is not an upper dev.
>> >> > Can you point to your port flavours patchset? Is it upstream?
>> >> I sent rfc couple of weeks ago:
>> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>> >
>> >
>> >
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Samudrala, Sridhar @ 2018-04-18 22:46 UTC (permalink / raw)
To: Jiri Pirko, Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, kubakici, netdev, virtualization,
loseweigh, davem
In-Reply-To: <20180418203206.GC1922@nanopsycho>
On 4/18/2018 1:32 PM, Jiri Pirko wrote:
>>>>>>> You still use "active"/"backup" names which is highly misleading as
>>>>>>> it has completely different meaning that in bond for example.
>>>>>>> I noted that in my previous review already. Please change it.
>>>>>> I guess the issue is with only the 'active' name. 'backup' should be fine as it also
>>>>>> matches with the BACKUP feature bit we are adding to virtio_net.
>>>>> I think that "backup" is also misleading. Both "active" and "backup"
>>>>> mean a *state* of slaves. This should be named differently.
>>>>>
>>>>>
>>>>>
>>>>>> With regards to alternate names for 'active', you suggested 'stolen', but i
>>>>>> am not too happy with it.
>>>>>> netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>>>>> No. The netdev could be any netdevice. It does not have to be a "VF".
>>>>> I think "stolen" is quite appropriate since it describes the modus
>>>>> operandi. The bypass master steals some netdevice according to some
>>>>> match.
>>>>>
>>>>> But I don't insist on "stolen". Just sounds right.
>>>> We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>>>> 'backup' name is consistent.
>>> It perhaps makes sense from the view of virtio device. However, as I
>>> described couple of times, for master/slave device the name "backup" is
>>> highly misleading.
>> virtio is the backup. You are supposed to use another
>> (typically passthrough) device, if that fails use virtio.
>> It does seem appropriate to me. If you like, we can
>> change that to "standby". Active I don't like either. "main"?
> Sounds much better, yes.
OK. Will change backup to 'standby'.
'main' is fine, what about 'primary'?
>
>
>> In fact would failover be better than bypass?
> Also, much better.
So do we want to change all 'bypass' references to 'failover' including
the filenames.(net/core/failover.c and include/net/failover.h)
<snip>
>
>
>>
>>>> The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>>>> a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>>>>
>>>> Will look for any suggestions in the next day or two. If i don't get any, i will go
>>>> with 'stolen'
>>>>
>>>> <snip>
>>>>
>>>>
>>>>> +
>>>>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>>>>> + struct bypass_ops **ops)
>>>>> +{
>>>>> + struct bypass_master *bypass_master;
>>>>> + struct net_device *bypass_netdev;
>>>>> +
>>>>> + spin_lock(&bypass_lock);
>>>>> + list_for_each_entry(bypass_master, &bypass_master_list, list) {
>>>>>>> As I wrote the last time, you don't need this list, spinlock.
>>>>>>> You can do just something like:
>>>>>>> for_each_net(net) {
>>>>>>> for_each_netdev(net, dev) {
>>>>>>> if (netif_is_bypass_master(dev)) {
>>>>>> This function returns the upper netdev as well as the ops associated
>>>>>> with that netdev.
>>>>>> bypass_master_list is a list of 'struct bypass_master' that associates
>>>>> Well, can't you have it in netdev priv?
>>>> We cannot do this for 2-netdev model as there is no bypass_netdev created.
>>> Howcome? You have no master? I don't understand..
For 2-netdev model, the master netdev is not a new one created by the bypass module.
It is created by netvsc internally and passed via bypass_master_register()
<snip>
>>>
>>>>>>>> +
>>>>>>>> + /* Avoid Bonding master dev with same MAC registering as slave dev */
>>>>>>>> + if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>>>>>>> Yeah, this is certainly incorrect. One thing is, you should be using the
>>>>>>> helpers netif_is_bond_master().
>>>>>>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>>>>>>>
>>>>>>> You need to do it not by blacklisting, but with whitelisting. You need
>>>>>>> to whitelist VF devices. My port flavours patchset might help with this.
>>>>>> May be i can use netdev_has_lower_dev() helper to make sure that the slave
>>>>> I don't see such function in the code.
>>>> It is netdev_has_any_lower_dev(). I need to export it.
>>> Come on, you cannot use that. That would allow bonding without slaves,
>>> but the slaves could be added later on.
>>>
>>> What exactly you are trying to achieve by this?
I think i can remove this check. In pre-register,
for backup device, i check that its parent matches bypass device &
for vf device, we make sure that it is a pci device.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Samudrala, Sridhar @ 2018-04-18 23:33 UTC (permalink / raw)
To: Siwei Liu, David Miller
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
Jakub Kicinski, Netdev, virtualization, David Ahern, si-wei liu
In-Reply-To: <CADGSJ21nQ1BQctfZcThpmOcOUrpqLfG71jsh2trb1utVjNQH=Q@mail.gmail.com>
On 4/17/2018 5:26 PM, Siwei Liu wrote:
> I ran this with a few folks offline and gathered some good feedbacks
> that I'd like to share thus revive the discussion.
>
> First of all, as illustrated in the reply below, cloud service
> providers require transparent live migration. Specifically, the main
> target of our case is to support SR-IOV live migration via kernel
> upgrade while keeping the userspace of old distros unmodified. If it's
> because this use case is not appealing enough for the mainline to
> adopt, I will shut up and not continue discussing, although
> technically it's entirely possible (and there's precedent in other
> implementation) to do so to benefit any cloud service providers.
>
> If it's just the implementation of hiding netdev itself needs to be
> improved, such as implementing it as attribute flag or adding linkdump
> API, that's completely fine and we can look into that. However, the
> specific issue needs to be undestood beforehand is to make transparent
> SR-IOV to be able to take over the name (so inherit all the configs)
> from the lower netdev, which needs some games with uevents and name
> space reservation. So far I don't think it's been well discussed.
>
> One thing in particular I'd like to point out is that the 3-netdev
> model currently missed to address the core problem of live migration:
> migration of hardware specific feature/state, for e.g. ethtool configs
> and hardware offloading states. Only general network state (IP
> address, gateway, for eg.) associated with the bypass interface can be
> migrated. As a follow-up work, bypass driver can/should be enhanced to
> save and apply those hardware specific configs before or after
> migration as needed. The transparent 1-netdev model being proposed as
> part of this patch series will be able to solve that problem naturally
> by making all hardware specific configurations go through the central
> bypass driver, such that hardware configurations can be replayed when
> new VF or passthrough gets plugged back in. Although that
> corresponding function hasn't been implemented today, I'd like to
> refresh everyone's mind that is the core problem any live migration
> proposal should have addressed.
>
> If it would make things more clear to defer netdev hiding until all
> functionalities regarding centralizing and replay are implemented,
> we'd take advices like that and move on to implementing those features
> as follow-up patches. Once all needed features get done, we'd resume
> the work for hiding lower netdev at that point. Think it would be the
> best to make everyone understand the big picture in advance before
> going too far.
I think we should get the 3-netdev model integrated and add any additional
ndo_ops/ethool ops that we would like to support/migrate before looking into
hiding the lower netdevs.
>
> Thanks, comments welcome.
>
> -Siwei
>
>
> On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Siwei Liu <loseweigh@gmail.com>
>>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>>
>>>> And I assume everyone here understands the use case for live
>>>> migration (in the context of providing cloud service) is very
>>>> different, and we have to hide the netdevs. If not, I'm more than
>>>> happy to clarify.
>>> I think you still need to clarify.
>> OK. The short answer is cloud users really want *transparent* live migration.
>>
>> By being transparent it means they don't and shouldn't care about the
>> existence and the occurence of live migration, but they do if
>> userspace toolstack and libraries have to be updated or modified,
>> which means potential dependency brokeness of their applications. They
>> don't like any change to the userspace envinroment (existing apps
>> lift-and-shift, no recompilation, no re-packaging, no re-certification
>> needed), while no one barely cares about ABI or API compatibility in
>> the kernel level, as long as their applications don't break.
>>
>> I agree the current bypass solution for SR-IOV live migration requires
>> guest cooperation. Though it doesn't mean guest *userspace*
>> cooperation. As a matter of fact, techinically it shouldn't invovle
>> userspace at all to get SR-IOV migration working. It's the kernel that
>> does the real work. If I understand the goal of this in-kernel
>> approach correctly, it was meant to save userspace from modification
>> or corresponding toolstack support, as those additional 2 interfaces
>> is more a side product of this approach, rather than being neccessary
>> for users to be aware of. All what the user needs to deal with is one
>> single interface, and that's what they care about. It's more a trouble
>> than help when they see 2 extra interfaces are present. Management
>> tools in the old distros don't recoginze them and try to bring up
>> those extra interfaces for its own. Various odd warnings start to spew
>> out, and there's a lot of caveats for the users to get around...
>>
>> On the other hand, if we "teach" those cloud users to update the
>> userspace toolstack just for trading a feature they don't need, no one
>> is likely going to embrace the change. As such there's just no real
>> value of adopting this in-kernel bypass facility for any cloud service
>> provider. It does not look more appealing than just configure generic
>> bonding using its own set of daemons or scripts. But again, cloud
>> users don't welcome that facility. And basically it would get to
>> nearly the same set of problems if leaving userspace alone.
>>
>> IMHO we're not hiding the devices, think it the way we're adding a
>> feature transparent to user. Those auto-managed slaves are ones users
>> don't care about much. And user is still able to see and configure the
>> lower netdevs if they really desires to do so. But generally the
>> target user for this feature won't need to know that. Why they care
>> how many interfaces a VM virtually has rather than how many interfaces
>> are actually _useable_ to them??
>>
>> Thanks,
>> -Siwei
>>
>>
>>> netdevs are netdevs. If they have special attributes, mark them as
>>> such and the tools base their actions upon that.
>>>
>>> "Hiding", or changing classes, doesn't make any sense to me still.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Michael S. Tsirkin @ 2018-04-19 4:08 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180418203206.GC1922@nanopsycho>
On Wed, Apr 18, 2018 at 10:32:06PM +0200, Jiri Pirko wrote:
> >> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
> >> >> > am not too happy with it.
> >> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> >> >> No. The netdev could be any netdevice. It does not have to be a "VF".
> >> >> I think "stolen" is quite appropriate since it describes the modus
> >> >> operandi. The bypass master steals some netdevice according to some
> >> >> match.
> >> >>
> >> >> But I don't insist on "stolen". Just sounds right.
> >> >
> >> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
> >> >'backup' name is consistent.
> >>
> >> It perhaps makes sense from the view of virtio device. However, as I
> >> described couple of times, for master/slave device the name "backup" is
> >> highly misleading.
> >
> >virtio is the backup. You are supposed to use another
> >(typically passthrough) device, if that fails use virtio.
> >It does seem appropriate to me. If you like, we can
> >change that to "standby". Active I don't like either. "main"?
>
> Sounds much better, yes.
Excuse me, which of the versions are better in your eyes?
>
> >
> >In fact would failover be better than bypass?
>
> Also, much better.
>
^ permalink raw reply
* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Michael S. Tsirkin @ 2018-04-19 4:41 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
virtualization, Siwei Liu, David Ahern, si-wei liu, David Miller
In-Reply-To: <1f3af59f-fd64-cc0d-f9eb-668636c52db4@intel.com>
On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > I ran this with a few folks offline and gathered some good feedbacks
> > that I'd like to share thus revive the discussion.
> >
> > First of all, as illustrated in the reply below, cloud service
> > providers require transparent live migration. Specifically, the main
> > target of our case is to support SR-IOV live migration via kernel
> > upgrade while keeping the userspace of old distros unmodified. If it's
> > because this use case is not appealing enough for the mainline to
> > adopt, I will shut up and not continue discussing, although
> > technically it's entirely possible (and there's precedent in other
> > implementation) to do so to benefit any cloud service providers.
> >
> > If it's just the implementation of hiding netdev itself needs to be
> > improved, such as implementing it as attribute flag or adding linkdump
> > API, that's completely fine and we can look into that. However, the
> > specific issue needs to be undestood beforehand is to make transparent
> > SR-IOV to be able to take over the name (so inherit all the configs)
> > from the lower netdev, which needs some games with uevents and name
> > space reservation. So far I don't think it's been well discussed.
> >
> > One thing in particular I'd like to point out is that the 3-netdev
> > model currently missed to address the core problem of live migration:
> > migration of hardware specific feature/state, for e.g. ethtool configs
> > and hardware offloading states. Only general network state (IP
> > address, gateway, for eg.) associated with the bypass interface can be
> > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > save and apply those hardware specific configs before or after
> > migration as needed. The transparent 1-netdev model being proposed as
> > part of this patch series will be able to solve that problem naturally
> > by making all hardware specific configurations go through the central
> > bypass driver, such that hardware configurations can be replayed when
> > new VF or passthrough gets plugged back in. Although that
> > corresponding function hasn't been implemented today, I'd like to
> > refresh everyone's mind that is the core problem any live migration
> > proposal should have addressed.
> >
> > If it would make things more clear to defer netdev hiding until all
> > functionalities regarding centralizing and replay are implemented,
> > we'd take advices like that and move on to implementing those features
> > as follow-up patches. Once all needed features get done, we'd resume
> > the work for hiding lower netdev at that point. Think it would be the
> > best to make everyone understand the big picture in advance before
> > going too far.
>
> I think we should get the 3-netdev model integrated and add any additional
> ndo_ops/ethool ops that we would like to support/migrate before looking into
> hiding the lower netdevs.
Once they are exposed, I don't think we'll be able to hide them -
they will be a kernel ABI.
Do you think everyone needs to hide the SRIOV device?
Or that only some users need this?
--
MST
^ permalink raw reply
* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Samudrala, Sridhar @ 2018-04-19 5:00 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
virtualization, Siwei Liu, David Ahern, si-wei liu, David Miller
In-Reply-To: <20180419072003-mutt-send-email-mst@kernel.org>
On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>> I ran this with a few folks offline and gathered some good feedbacks
>>> that I'd like to share thus revive the discussion.
>>>
>>> First of all, as illustrated in the reply below, cloud service
>>> providers require transparent live migration. Specifically, the main
>>> target of our case is to support SR-IOV live migration via kernel
>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>> because this use case is not appealing enough for the mainline to
>>> adopt, I will shut up and not continue discussing, although
>>> technically it's entirely possible (and there's precedent in other
>>> implementation) to do so to benefit any cloud service providers.
>>>
>>> If it's just the implementation of hiding netdev itself needs to be
>>> improved, such as implementing it as attribute flag or adding linkdump
>>> API, that's completely fine and we can look into that. However, the
>>> specific issue needs to be undestood beforehand is to make transparent
>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>> from the lower netdev, which needs some games with uevents and name
>>> space reservation. So far I don't think it's been well discussed.
>>>
>>> One thing in particular I'd like to point out is that the 3-netdev
>>> model currently missed to address the core problem of live migration:
>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>> and hardware offloading states. Only general network state (IP
>>> address, gateway, for eg.) associated with the bypass interface can be
>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>> save and apply those hardware specific configs before or after
>>> migration as needed. The transparent 1-netdev model being proposed as
>>> part of this patch series will be able to solve that problem naturally
>>> by making all hardware specific configurations go through the central
>>> bypass driver, such that hardware configurations can be replayed when
>>> new VF or passthrough gets plugged back in. Although that
>>> corresponding function hasn't been implemented today, I'd like to
>>> refresh everyone's mind that is the core problem any live migration
>>> proposal should have addressed.
>>>
>>> If it would make things more clear to defer netdev hiding until all
>>> functionalities regarding centralizing and replay are implemented,
>>> we'd take advices like that and move on to implementing those features
>>> as follow-up patches. Once all needed features get done, we'd resume
>>> the work for hiding lower netdev at that point. Think it would be the
>>> best to make everyone understand the big picture in advance before
>>> going too far.
>> I think we should get the 3-netdev model integrated and add any additional
>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>> hiding the lower netdevs.
> Once they are exposed, I don't think we'll be able to hide them -
> they will be a kernel ABI.
>
> Do you think everyone needs to hide the SRIOV device?
> Or that only some users need this?
Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
think it is a hard requirement. And also, as we don't yet have a consensus on how to hide
the lower netdevs, we could make it as another feature bit to hide lower netdevs once
we have an acceptable solution.
^ permalink raw reply
* [PATCH net] virtio_net: split out ctrl buffer
From: Michael S. Tsirkin @ 2018-04-19 5:01 UTC (permalink / raw)
To: linux-kernel
Cc: Eric Dumazet, netdev, virtualization, Mikulas Patocka,
David Miller
When sending control commands, virtio net sets up several buffers for
DMA. The buffers are all part of the net device which means it's
actually allocated by kvmalloc so in theory (on extreme memory pressure)
it's possible to get a vmalloc'ed buffer which on some platforms means
we can't DMA there.
Fix up by moving the DMA buffers out into a separate structure.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
Lightly tested.
Mikulas, does this address your problem?
drivers/net/virtio_net.c | 68 +++++++++++++++++++++++++++---------------------
1 file changed, 39 insertions(+), 29 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec..82f50e5 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -147,6 +147,17 @@ struct receive_queue {
struct xdp_rxq_info xdp_rxq;
};
+/* Control VQ buffers: protected by the rtnl lock */
+struct control_buf {
+ struct virtio_net_ctrl_hdr hdr;
+ virtio_net_ctrl_ack status;
+ struct virtio_net_ctrl_mq mq;
+ u8 promisc;
+ u8 allmulti;
+ u16 vid;
+ u64 offloads;
+};
+
struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *cvq;
@@ -192,14 +203,7 @@ struct virtnet_info {
struct hlist_node node;
struct hlist_node node_dead;
- /* Control VQ buffers: protected by the rtnl lock */
- struct virtio_net_ctrl_hdr ctrl_hdr;
- virtio_net_ctrl_ack ctrl_status;
- struct virtio_net_ctrl_mq ctrl_mq;
- u8 ctrl_promisc;
- u8 ctrl_allmulti;
- u16 ctrl_vid;
- u64 ctrl_offloads;
+ struct control_buf *ctrl;
/* Ethtool settings */
u8 duplex;
@@ -1454,25 +1458,25 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
- vi->ctrl_status = ~0;
- vi->ctrl_hdr.class = class;
- vi->ctrl_hdr.cmd = cmd;
+ vi->ctrl->status = ~0;
+ vi->ctrl->hdr.class = class;
+ vi->ctrl->hdr.cmd = cmd;
/* Add header */
- sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr));
+ sg_init_one(&hdr, &vi->ctrl->hdr, sizeof(vi->ctrl->hdr));
sgs[out_num++] = &hdr;
if (out)
sgs[out_num++] = out;
/* Add return status. */
- sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status));
+ sg_init_one(&stat, &vi->ctrl->status, sizeof(vi->ctrl->status));
sgs[out_num] = &stat;
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
if (unlikely(!virtqueue_kick(vi->cvq)))
- return vi->ctrl_status == VIRTIO_NET_OK;
+ return vi->ctrl->status == VIRTIO_NET_OK;
/* Spin for a response, the kick causes an ioport write, trapping
* into the hypervisor, so the request should be handled immediately.
@@ -1481,7 +1485,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
!virtqueue_is_broken(vi->cvq))
cpu_relax();
- return vi->ctrl_status == VIRTIO_NET_OK;
+ return vi->ctrl->status == VIRTIO_NET_OK;
}
static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1593,8 +1597,8 @@ static int _virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
if (!vi->has_cvq || !virtio_has_feature(vi->vdev, VIRTIO_NET_F_MQ))
return 0;
- vi->ctrl_mq.virtqueue_pairs = cpu_to_virtio16(vi->vdev, queue_pairs);
- sg_init_one(&sg, &vi->ctrl_mq, sizeof(vi->ctrl_mq));
+ vi->ctrl->mq.virtqueue_pairs = cpu_to_virtio16(vi->vdev, queue_pairs);
+ sg_init_one(&sg, &vi->ctrl->mq, sizeof(vi->ctrl->mq));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg)) {
@@ -1653,22 +1657,22 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
- vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0);
- vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+ vi->ctrl->promisc = ((dev->flags & IFF_PROMISC) != 0);
+ vi->ctrl->allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
- sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc));
+ sg_init_one(sg, &vi->ctrl->promisc, sizeof(vi->ctrl->promisc));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
- vi->ctrl_promisc ? "en" : "dis");
+ vi->ctrl->promisc ? "en" : "dis");
- sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti));
+ sg_init_one(sg, &vi->ctrl->allmulti, sizeof(vi->ctrl->allmulti));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
- vi->ctrl_allmulti ? "en" : "dis");
+ vi->ctrl->allmulti ? "en" : "dis");
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
@@ -1714,8 +1718,8 @@ static int virtnet_vlan_rx_add_vid(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg;
- vi->ctrl_vid = vid;
- sg_init_one(&sg, &vi->ctrl_vid, sizeof(vi->ctrl_vid));
+ vi->ctrl->vid = vid;
+ sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
VIRTIO_NET_CTRL_VLAN_ADD, &sg))
@@ -1729,8 +1733,8 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg;
- vi->ctrl_vid = vid;
- sg_init_one(&sg, &vi->ctrl_vid, sizeof(vi->ctrl_vid));
+ vi->ctrl->vid = vid;
+ sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
VIRTIO_NET_CTRL_VLAN_DEL, &sg))
@@ -2126,9 +2130,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
static int virtnet_set_guest_offloads(struct virtnet_info *vi, u64 offloads)
{
struct scatterlist sg;
- vi->ctrl_offloads = cpu_to_virtio64(vi->vdev, offloads);
+ vi->ctrl->offloads = cpu_to_virtio64(vi->vdev, offloads);
- sg_init_one(&sg, &vi->ctrl_offloads, sizeof(vi->ctrl_offloads));
+ sg_init_one(&sg, &vi->ctrl->offloads, sizeof(vi->ctrl->offloads));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_GUEST_OFFLOADS,
VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET, &sg)) {
@@ -2351,6 +2355,7 @@ static void virtnet_free_queues(struct virtnet_info *vi)
kfree(vi->rq);
kfree(vi->sq);
+ kfree(vi->err_ctrl);
}
static void _free_receive_bufs(struct virtnet_info *vi)
@@ -2543,6 +2548,9 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
{
int i;
+ vi->ctrl = kzalloc(sizeof(*vi->ctrl), GFP_KERNEL);
+ if (!vi->ctrl)
+ goto err_ctrl;
vi->sq = kzalloc(sizeof(*vi->sq) * vi->max_queue_pairs, GFP_KERNEL);
if (!vi->sq)
goto err_sq;
@@ -2571,6 +2579,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
err_rq:
kfree(vi->sq);
err_sq:
+ kfree(vq->ctrl);
+err_ctrl:
return -ENOMEM;
}
--
MST
^ permalink raw reply related
* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Michael S. Tsirkin @ 2018-04-19 5:07 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
virtualization, Siwei Liu, David Ahern, si-wei liu, David Miller
In-Reply-To: <4e029524-d542-0f12-bfdb-7c8a2f198e28@intel.com>
On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> > On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> > > On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > > > I ran this with a few folks offline and gathered some good feedbacks
> > > > that I'd like to share thus revive the discussion.
> > > >
> > > > First of all, as illustrated in the reply below, cloud service
> > > > providers require transparent live migration. Specifically, the main
> > > > target of our case is to support SR-IOV live migration via kernel
> > > > upgrade while keeping the userspace of old distros unmodified. If it's
> > > > because this use case is not appealing enough for the mainline to
> > > > adopt, I will shut up and not continue discussing, although
> > > > technically it's entirely possible (and there's precedent in other
> > > > implementation) to do so to benefit any cloud service providers.
> > > >
> > > > If it's just the implementation of hiding netdev itself needs to be
> > > > improved, such as implementing it as attribute flag or adding linkdump
> > > > API, that's completely fine and we can look into that. However, the
> > > > specific issue needs to be undestood beforehand is to make transparent
> > > > SR-IOV to be able to take over the name (so inherit all the configs)
> > > > from the lower netdev, which needs some games with uevents and name
> > > > space reservation. So far I don't think it's been well discussed.
> > > >
> > > > One thing in particular I'd like to point out is that the 3-netdev
> > > > model currently missed to address the core problem of live migration:
> > > > migration of hardware specific feature/state, for e.g. ethtool configs
> > > > and hardware offloading states. Only general network state (IP
> > > > address, gateway, for eg.) associated with the bypass interface can be
> > > > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > > > save and apply those hardware specific configs before or after
> > > > migration as needed. The transparent 1-netdev model being proposed as
> > > > part of this patch series will be able to solve that problem naturally
> > > > by making all hardware specific configurations go through the central
> > > > bypass driver, such that hardware configurations can be replayed when
> > > > new VF or passthrough gets plugged back in. Although that
> > > > corresponding function hasn't been implemented today, I'd like to
> > > > refresh everyone's mind that is the core problem any live migration
> > > > proposal should have addressed.
> > > >
> > > > If it would make things more clear to defer netdev hiding until all
> > > > functionalities regarding centralizing and replay are implemented,
> > > > we'd take advices like that and move on to implementing those features
> > > > as follow-up patches. Once all needed features get done, we'd resume
> > > > the work for hiding lower netdev at that point. Think it would be the
> > > > best to make everyone understand the big picture in advance before
> > > > going too far.
> > > I think we should get the 3-netdev model integrated and add any additional
> > > ndo_ops/ethool ops that we would like to support/migrate before looking into
> > > hiding the lower netdevs.
> > Once they are exposed, I don't think we'll be able to hide them -
> > they will be a kernel ABI.
> >
> > Do you think everyone needs to hide the SRIOV device?
> > Or that only some users need this?
>
> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
> think it is a hard requirement.
OK, fine.
> And also, as we don't yet have a consensus on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
> we have an acceptable solution.
Guest/host interface isn't more flexible than the userspace/kernel
interface. The feature bit you propose would say what exactly?
Hypervisor has no idea what guest kernel shows guest userspace.
Note that the backup flag doesn't tell guest kernel what to do,
it just tells guest that there is or will be a faster main device
connected to the same backend, so the backup should only be used
when main device is not present.
--
MST
^ permalink raw reply
* Re: [PATCH net] virtio_net: split out ctrl buffer
From: Michael S. Tsirkin @ 2018-04-19 5:18 UTC (permalink / raw)
To: linux-kernel
Cc: Eric Dumazet, netdev, virtualization, Mikulas Patocka,
David Miller
In-Reply-To: <1524113437-308621-1-git-send-email-mst@redhat.com>
On Thu, Apr 19, 2018 at 08:01:49AM +0300, Michael S. Tsirkin wrote:
> When sending control commands, virtio net sets up several buffers for
> DMA. The buffers are all part of the net device which means it's
> actually allocated by kvmalloc so in theory (on extreme memory pressure)
> it's possible to get a vmalloc'ed buffer which on some platforms means
> we can't DMA there.
>
> Fix up by moving the DMA buffers out into a separate structure.
>
> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Ouch forgot to commit a fix. Pls ignore will send v2 now.
> ---
>
> Lightly tested.
> Mikulas, does this address your problem?
>
> drivers/net/virtio_net.c | 68 +++++++++++++++++++++++++++---------------------
> 1 file changed, 39 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..82f50e5 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -147,6 +147,17 @@ struct receive_queue {
> struct xdp_rxq_info xdp_rxq;
> };
>
> +/* Control VQ buffers: protected by the rtnl lock */
> +struct control_buf {
> + struct virtio_net_ctrl_hdr hdr;
> + virtio_net_ctrl_ack status;
> + struct virtio_net_ctrl_mq mq;
> + u8 promisc;
> + u8 allmulti;
> + u16 vid;
> + u64 offloads;
> +};
> +
> struct virtnet_info {
> struct virtio_device *vdev;
> struct virtqueue *cvq;
> @@ -192,14 +203,7 @@ struct virtnet_info {
> struct hlist_node node;
> struct hlist_node node_dead;
>
> - /* Control VQ buffers: protected by the rtnl lock */
> - struct virtio_net_ctrl_hdr ctrl_hdr;
> - virtio_net_ctrl_ack ctrl_status;
> - struct virtio_net_ctrl_mq ctrl_mq;
> - u8 ctrl_promisc;
> - u8 ctrl_allmulti;
> - u16 ctrl_vid;
> - u64 ctrl_offloads;
> + struct control_buf *ctrl;
>
> /* Ethtool settings */
> u8 duplex;
> @@ -1454,25 +1458,25 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> /* Caller should know better */
> BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>
> - vi->ctrl_status = ~0;
> - vi->ctrl_hdr.class = class;
> - vi->ctrl_hdr.cmd = cmd;
> + vi->ctrl->status = ~0;
> + vi->ctrl->hdr.class = class;
> + vi->ctrl->hdr.cmd = cmd;
> /* Add header */
> - sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr));
> + sg_init_one(&hdr, &vi->ctrl->hdr, sizeof(vi->ctrl->hdr));
> sgs[out_num++] = &hdr;
>
> if (out)
> sgs[out_num++] = out;
>
> /* Add return status. */
> - sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status));
> + sg_init_one(&stat, &vi->ctrl->status, sizeof(vi->ctrl->status));
> sgs[out_num] = &stat;
>
> BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
> virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
>
> if (unlikely(!virtqueue_kick(vi->cvq)))
> - return vi->ctrl_status == VIRTIO_NET_OK;
> + return vi->ctrl->status == VIRTIO_NET_OK;
>
> /* Spin for a response, the kick causes an ioport write, trapping
> * into the hypervisor, so the request should be handled immediately.
> @@ -1481,7 +1485,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> !virtqueue_is_broken(vi->cvq))
> cpu_relax();
>
> - return vi->ctrl_status == VIRTIO_NET_OK;
> + return vi->ctrl->status == VIRTIO_NET_OK;
> }
>
> static int virtnet_set_mac_address(struct net_device *dev, void *p)
> @@ -1593,8 +1597,8 @@ static int _virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
> if (!vi->has_cvq || !virtio_has_feature(vi->vdev, VIRTIO_NET_F_MQ))
> return 0;
>
> - vi->ctrl_mq.virtqueue_pairs = cpu_to_virtio16(vi->vdev, queue_pairs);
> - sg_init_one(&sg, &vi->ctrl_mq, sizeof(vi->ctrl_mq));
> + vi->ctrl->mq.virtqueue_pairs = cpu_to_virtio16(vi->vdev, queue_pairs);
> + sg_init_one(&sg, &vi->ctrl->mq, sizeof(vi->ctrl->mq));
>
> if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
> VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg)) {
> @@ -1653,22 +1657,22 @@ static void virtnet_set_rx_mode(struct net_device *dev)
> if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
> return;
>
> - vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0);
> - vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
> + vi->ctrl->promisc = ((dev->flags & IFF_PROMISC) != 0);
> + vi->ctrl->allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
>
> - sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc));
> + sg_init_one(sg, &vi->ctrl->promisc, sizeof(vi->ctrl->promisc));
>
> if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> VIRTIO_NET_CTRL_RX_PROMISC, sg))
> dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
> - vi->ctrl_promisc ? "en" : "dis");
> + vi->ctrl->promisc ? "en" : "dis");
>
> - sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti));
> + sg_init_one(sg, &vi->ctrl->allmulti, sizeof(vi->ctrl->allmulti));
>
> if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
> dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
> - vi->ctrl_allmulti ? "en" : "dis");
> + vi->ctrl->allmulti ? "en" : "dis");
>
> uc_count = netdev_uc_count(dev);
> mc_count = netdev_mc_count(dev);
> @@ -1714,8 +1718,8 @@ static int virtnet_vlan_rx_add_vid(struct net_device *dev,
> struct virtnet_info *vi = netdev_priv(dev);
> struct scatterlist sg;
>
> - vi->ctrl_vid = vid;
> - sg_init_one(&sg, &vi->ctrl_vid, sizeof(vi->ctrl_vid));
> + vi->ctrl->vid = vid;
> + sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
>
> if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
> VIRTIO_NET_CTRL_VLAN_ADD, &sg))
> @@ -1729,8 +1733,8 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev,
> struct virtnet_info *vi = netdev_priv(dev);
> struct scatterlist sg;
>
> - vi->ctrl_vid = vid;
> - sg_init_one(&sg, &vi->ctrl_vid, sizeof(vi->ctrl_vid));
> + vi->ctrl->vid = vid;
> + sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
>
> if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
> VIRTIO_NET_CTRL_VLAN_DEL, &sg))
> @@ -2126,9 +2130,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
> static int virtnet_set_guest_offloads(struct virtnet_info *vi, u64 offloads)
> {
> struct scatterlist sg;
> - vi->ctrl_offloads = cpu_to_virtio64(vi->vdev, offloads);
> + vi->ctrl->offloads = cpu_to_virtio64(vi->vdev, offloads);
>
> - sg_init_one(&sg, &vi->ctrl_offloads, sizeof(vi->ctrl_offloads));
> + sg_init_one(&sg, &vi->ctrl->offloads, sizeof(vi->ctrl->offloads));
>
> if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_GUEST_OFFLOADS,
> VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET, &sg)) {
> @@ -2351,6 +2355,7 @@ static void virtnet_free_queues(struct virtnet_info *vi)
>
> kfree(vi->rq);
> kfree(vi->sq);
> + kfree(vi->err_ctrl);
> }
>
> static void _free_receive_bufs(struct virtnet_info *vi)
> @@ -2543,6 +2548,9 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
> {
> int i;
>
> + vi->ctrl = kzalloc(sizeof(*vi->ctrl), GFP_KERNEL);
> + if (!vi->ctrl)
> + goto err_ctrl;
> vi->sq = kzalloc(sizeof(*vi->sq) * vi->max_queue_pairs, GFP_KERNEL);
> if (!vi->sq)
> goto err_sq;
> @@ -2571,6 +2579,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
> err_rq:
> kfree(vi->sq);
> err_sq:
> + kfree(vq->ctrl);
> +err_ctrl:
> return -ENOMEM;
> }
>
> --
> MST
^ permalink raw reply
* [PATCH v2 net 1/3] virtio_net: split out ctrl buffer
From: Michael S. Tsirkin @ 2018-04-19 5:30 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Huth, Eric Dumazet, netdev, Cornelia Huck, virtualization,
Mikulas Patocka, David Miller
In-Reply-To: <1524115776-334953-1-git-send-email-mst@redhat.com>
When sending control commands, virtio net sets up several buffers for
DMA. The buffers are all part of the net device which means it's
actually allocated by kvmalloc so it's in theory (on extreme memory
pressure) possible to get a vmalloc'ed buffer which on some platforms
means we can't DMA there.
Fix up by moving the DMA buffers into a separate structure.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
Changes from v1:
build fix
drivers/net/virtio_net.c | 68 +++++++++++++++++++++++++++---------------------
1 file changed, 39 insertions(+), 29 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec..3d0eff53 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -147,6 +147,17 @@ struct receive_queue {
struct xdp_rxq_info xdp_rxq;
};
+/* Control VQ buffers: protected by the rtnl lock */
+struct control_buf {
+ struct virtio_net_ctrl_hdr hdr;
+ virtio_net_ctrl_ack status;
+ struct virtio_net_ctrl_mq mq;
+ u8 promisc;
+ u8 allmulti;
+ u16 vid;
+ u64 offloads;
+};
+
struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *cvq;
@@ -192,14 +203,7 @@ struct virtnet_info {
struct hlist_node node;
struct hlist_node node_dead;
- /* Control VQ buffers: protected by the rtnl lock */
- struct virtio_net_ctrl_hdr ctrl_hdr;
- virtio_net_ctrl_ack ctrl_status;
- struct virtio_net_ctrl_mq ctrl_mq;
- u8 ctrl_promisc;
- u8 ctrl_allmulti;
- u16 ctrl_vid;
- u64 ctrl_offloads;
+ struct control_buf *ctrl;
/* Ethtool settings */
u8 duplex;
@@ -1454,25 +1458,25 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
- vi->ctrl_status = ~0;
- vi->ctrl_hdr.class = class;
- vi->ctrl_hdr.cmd = cmd;
+ vi->ctrl->status = ~0;
+ vi->ctrl->hdr.class = class;
+ vi->ctrl->hdr.cmd = cmd;
/* Add header */
- sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr));
+ sg_init_one(&hdr, &vi->ctrl->hdr, sizeof(vi->ctrl->hdr));
sgs[out_num++] = &hdr;
if (out)
sgs[out_num++] = out;
/* Add return status. */
- sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status));
+ sg_init_one(&stat, &vi->ctrl->status, sizeof(vi->ctrl->status));
sgs[out_num] = &stat;
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
if (unlikely(!virtqueue_kick(vi->cvq)))
- return vi->ctrl_status == VIRTIO_NET_OK;
+ return vi->ctrl->status == VIRTIO_NET_OK;
/* Spin for a response, the kick causes an ioport write, trapping
* into the hypervisor, so the request should be handled immediately.
@@ -1481,7 +1485,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
!virtqueue_is_broken(vi->cvq))
cpu_relax();
- return vi->ctrl_status == VIRTIO_NET_OK;
+ return vi->ctrl->status == VIRTIO_NET_OK;
}
static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1593,8 +1597,8 @@ static int _virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
if (!vi->has_cvq || !virtio_has_feature(vi->vdev, VIRTIO_NET_F_MQ))
return 0;
- vi->ctrl_mq.virtqueue_pairs = cpu_to_virtio16(vi->vdev, queue_pairs);
- sg_init_one(&sg, &vi->ctrl_mq, sizeof(vi->ctrl_mq));
+ vi->ctrl->mq.virtqueue_pairs = cpu_to_virtio16(vi->vdev, queue_pairs);
+ sg_init_one(&sg, &vi->ctrl->mq, sizeof(vi->ctrl->mq));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg)) {
@@ -1653,22 +1657,22 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
- vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0);
- vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+ vi->ctrl->promisc = ((dev->flags & IFF_PROMISC) != 0);
+ vi->ctrl->allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
- sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc));
+ sg_init_one(sg, &vi->ctrl->promisc, sizeof(vi->ctrl->promisc));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
- vi->ctrl_promisc ? "en" : "dis");
+ vi->ctrl->promisc ? "en" : "dis");
- sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti));
+ sg_init_one(sg, &vi->ctrl->allmulti, sizeof(vi->ctrl->allmulti));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
- vi->ctrl_allmulti ? "en" : "dis");
+ vi->ctrl->allmulti ? "en" : "dis");
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
@@ -1714,8 +1718,8 @@ static int virtnet_vlan_rx_add_vid(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg;
- vi->ctrl_vid = vid;
- sg_init_one(&sg, &vi->ctrl_vid, sizeof(vi->ctrl_vid));
+ vi->ctrl->vid = vid;
+ sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
VIRTIO_NET_CTRL_VLAN_ADD, &sg))
@@ -1729,8 +1733,8 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg;
- vi->ctrl_vid = vid;
- sg_init_one(&sg, &vi->ctrl_vid, sizeof(vi->ctrl_vid));
+ vi->ctrl->vid = vid;
+ sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
VIRTIO_NET_CTRL_VLAN_DEL, &sg))
@@ -2126,9 +2130,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
static int virtnet_set_guest_offloads(struct virtnet_info *vi, u64 offloads)
{
struct scatterlist sg;
- vi->ctrl_offloads = cpu_to_virtio64(vi->vdev, offloads);
+ vi->ctrl->offloads = cpu_to_virtio64(vi->vdev, offloads);
- sg_init_one(&sg, &vi->ctrl_offloads, sizeof(vi->ctrl_offloads));
+ sg_init_one(&sg, &vi->ctrl->offloads, sizeof(vi->ctrl->offloads));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_GUEST_OFFLOADS,
VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET, &sg)) {
@@ -2351,6 +2355,7 @@ static void virtnet_free_queues(struct virtnet_info *vi)
kfree(vi->rq);
kfree(vi->sq);
+ kfree(vi->ctrl);
}
static void _free_receive_bufs(struct virtnet_info *vi)
@@ -2543,6 +2548,9 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
{
int i;
+ vi->ctrl = kzalloc(sizeof(*vi->ctrl), GFP_KERNEL);
+ if (!vi->ctrl)
+ goto err_ctrl;
vi->sq = kzalloc(sizeof(*vi->sq) * vi->max_queue_pairs, GFP_KERNEL);
if (!vi->sq)
goto err_sq;
@@ -2571,6 +2579,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
err_rq:
kfree(vi->sq);
err_sq:
+ kfree(vi->ctrl);
+err_ctrl:
return -ENOMEM;
}
--
MST
^ permalink raw reply related
* [PATCH v2 net 2/3] virtio_net: fix adding vids on big-endian
From: Michael S. Tsirkin @ 2018-04-19 5:30 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Huth, Eric Dumazet, netdev, Cornelia Huck, virtualization,
Mikulas Patocka, David Miller
In-Reply-To: <1524115776-334953-1-git-send-email-mst@redhat.com>
Programming vids (adding or removing them) still passes
guest-endian values in the DMA buffer. That's wrong
if guest is big-endian and when virtio 1 is enabled.
Note: this is on top of a previous patch:
virtio_net: split out ctrl buffer
Fixes: 9465a7a6f ("virtio_net: enable v1.0 support")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/net/virtio_net.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3d0eff53..f84fe04 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -154,7 +154,7 @@ struct control_buf {
struct virtio_net_ctrl_mq mq;
u8 promisc;
u8 allmulti;
- u16 vid;
+ __virtio16 vid;
u64 offloads;
};
@@ -1718,7 +1718,7 @@ static int virtnet_vlan_rx_add_vid(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg;
- vi->ctrl->vid = vid;
+ vi->ctrl->vid = cpu_to_virtio16(vi->vdev, vid);
sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
@@ -1733,7 +1733,7 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg;
- vi->ctrl->vid = vid;
+ vi->ctrl->vid = cpu_to_virtio16(vi->vdev, vid);
sg_init_one(&sg, &vi->ctrl->vid, sizeof(vi->ctrl->vid));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
--
MST
^ permalink raw reply related
* [PATCH v2 net 3/3] virtio_net: sparse annotation fix
From: Michael S. Tsirkin @ 2018-04-19 5:30 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Huth, Eric Dumazet, netdev, Cornelia Huck, virtualization,
Mikulas Patocka, David Miller
In-Reply-To: <1524115776-334953-1-git-send-email-mst@redhat.com>
offloads is a buffer in virtio format, should use
the __virtio64 tag.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/net/virtio_net.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f84fe04..c5b11f2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -155,7 +155,7 @@ struct control_buf {
u8 promisc;
u8 allmulti;
__virtio16 vid;
- u64 offloads;
+ __virtio64 offloads;
};
struct virtnet_info {
--
MST
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox