Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2] net: dsa: drop some VLAs in switch.c
From: Salvatore Mesoraca @ 2018-05-07 19:02 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, linux-kernel, Kernel Hardening, netdev,
	David S. Miller, Kees Cook, Vivien Didelot, David Laight
In-Reply-To: <d7fa9cf7-7c7e-b01c-8925-ce6dafc8721c@gmail.com>

2018-05-07 20:14 GMT+02:00 Florian Fainelli <f.fainelli@gmail.com>:
> On 05/07/2018 08:23 AM, Salvatore Mesoraca wrote:
>> We avoid 2 VLAs by using a pre-allocated field in dsa_switch.
>> We also try to avoid dynamic allocation whenever possible.
>>
>> Link: http://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
>> Link: http://lkml.kernel.org/r/20180505185145.GB32630@lunn.ch
>>
>> Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
>> ---
>>  include/net/dsa.h |  3 +++
>>  net/dsa/dsa2.c    | 14 ++++++++++++++
>>  net/dsa/switch.c  | 22 ++++++++++------------
>>  3 files changed, 27 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/net/dsa.h b/include/net/dsa.h
>> index 60fb4ec..576791d 100644
>> --- a/include/net/dsa.h
>> +++ b/include/net/dsa.h
>> @@ -256,6 +256,9 @@ struct dsa_switch {
>>       /* Number of switch port queues */
>>       unsigned int            num_tx_queues;
>>
>> +     unsigned long           *bitmap;
>> +     unsigned long           _bitmap;
>> +
>>       /* Dynamically allocated ports, keep last */
>>       size_t num_ports;
>>       struct dsa_port ports[];
>> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
>> index adf50fb..cebf35f0 100644
>> --- a/net/dsa/dsa2.c
>> +++ b/net/dsa/dsa2.c
>> @@ -748,6 +748,20 @@ struct dsa_switch *dsa_switch_alloc(struct device *dev, size_t n)
>>       if (!ds)
>>               return NULL;
>>
>> +     /* We avoid allocating memory outside dsa_switch
>> +      * if it is not needed.
>> +      */
>> +     if (n <= sizeof(ds->_bitmap) * 8) {
>> +             ds->bitmap = &ds->_bitmap;
>
> Should not this be / BITS_PER_BYTE? If the sizeof(unsigned long) is <=
> 8, then you don't need to allocate it, otherwise, you have to.

No.
We need one 1 bit per port, of course sizeof() returns size in bytes,
hence the multiplication to get the number of bits.
I might multiply per BITS_PER_BYTE instead of 8, but I doubt that
Linux supports implementations where a byte is not an octet.

> I would actually just always dynamically allocate the bitmap, optimizing
> for the case where we have fewer than or 8 ports is not worth IMHO.

This optimization will save us an allocation when number of ports is
less than 32 or 64 (depending on arch).
IMHO it's useful, if you consider that, right now, DSA works only with
12-ports switches.

Thank you for your time,

Salvatore

^ permalink raw reply

* [PATCH net] r8169: fix powering up RTL8168h
From: Heiner Kallweit @ 2018-05-07 19:11 UTC (permalink / raw)
  To: David Miller, Realtek linux nic maintainers
  Cc: Slava Kardakov, netdev@vger.kernel.org

Since commit a92a08499b1f "r8169: improve runtime pm in general and
suspend unused ports" interfaces w/o link are runtime-suspended after
10s. On systems where drivers take longer to load this can lead to the
situation that the interface is runtime-suspended already when it's
initially brought up.
This shouldn't be a problem because rtl_open() resumes MAC/PHY.
However with at least one chip version the interface doesn't properly
come up, as reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=199549

The vendor driver uses a delay to give certain chip versions some
time to resume before starting the PHY configuration. So let's do
the same. I don't know which chip versions may be affected,
therefore apply this delay always.

This patch was reported to fix the issue for RTL8168h.
I was able to reproduce the issue on an Asus H310I-Plus which also
uses a RTL8168h. Also in my case the patch fixed the issue.

Reported-by: Slava Kardakov <ojab@ojab.ru>
Tested-by: Slava Kardakov <ojab@ojab.ru>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
This patch will not apply to net-next as it conflicts with other
changes which have been done in the meantime. So I'll send a
separate patch for net-next.
---
 drivers/net/ethernet/realtek/r8169.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 604ae783..c7aac1fc 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4981,6 +4981,9 @@ static void rtl_pll_power_down(struct rtl8169_private *tp)
 static void rtl_pll_power_up(struct rtl8169_private *tp)
 {
 	rtl_generic_op(tp, tp->pll_power_ops.up);
+
+	/* give MAC/PHY some time to resume */
+	msleep(20);
 }
 
 static void rtl_init_pll_power_ops(struct rtl8169_private *tp)
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next] r8169: fix powering up RTL8168h
From: Heiner Kallweit @ 2018-05-07 19:13 UTC (permalink / raw)
  To: David Miller, Realtek linux nic maintainers
  Cc: Slava Kardakov, netdev@vger.kernel.org

Since commit a92a08499b1f "r8169: improve runtime pm in general and
suspend unused ports" interfaces w/o link are runtime-suspended after
10s. On systems where drivers take longer to load this can lead to the
situation that the interface is runtime-suspended already when it's
initially brought up.
This shouldn't be a problem because rtl_open() resumes MAC/PHY.
However with at least one chip version the interface doesn't properly
come up, as reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=199549

The vendor driver uses a delay to give certain chip versions some
time to resume before starting the PHY configuration. So let's do
the same. I don't know which chip versions may be affected,
therefore apply this delay always.

This patch was reported to fix the issue for RTL8168h.
I was able to reproduce the issue on an Asus H310I-Plus which also
uses a RTL8168h. Also in my case the patch fixed the issue.

Reported-by: Slava Kardakov <ojab@ojab.ru>
Tested-by: Slava Kardakov <ojab@ojab.ru>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 6d99b141..a60207a5 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4662,6 +4662,9 @@ static void r8168_phy_power_up(struct rtl8169_private *tp)
 		break;
 	}
 	rtl_writephy(tp, MII_BMCR, BMCR_ANENABLE);
+
+	/* give PHY some time to resume */
+	msleep(20);
 }
 
 static void r8168_phy_power_down(struct rtl8169_private *tp)
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net] r8169: fix powering up RTL8168h
From: ojab // @ 2018-05-07 19:20 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: David Miller, Realtek linux nic maintainers,
	netdev@vger.kernel.org
In-Reply-To: <ff344a31-4cb8-1e49-7b5e-3a729125444b@gmail.com>

On Mon, May 7, 2018 at 7:11 PM, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> Since commit a92a08499b1f "r8169: improve runtime pm in general and
> suspend unused ports" interfaces w/o link are runtime-suspended after
> 10s. On systems where drivers take longer to load this can lead to the
> situation that the interface is runtime-suspended already when it's
> initially brought up.
> This shouldn't be a problem because rtl_open() resumes MAC/PHY.
> However with at least one chip version the interface doesn't properly
> come up, as reported here:
> https://bugzilla.kernel.org/show_bug.cgi?id=199549
>
> The vendor driver uses a delay to give certain chip versions some
> time to resume before starting the PHY configuration. So let's do
> the same. I don't know which chip versions may be affected,
> therefore apply this delay always.
>
> This patch was reported to fix the issue for RTL8168h.
> I was able to reproduce the issue on an Asus H310I-Plus which also
> uses a RTL8168h. Also in my case the patch fixed the issue.
>
> Reported-by: Slava Kardakov <ojab@ojab.ru>
> Tested-by: Slava Kardakov <ojab@ojab.ru>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Maybe also
Cc: stable@vger.kernel.org # v4.16+
?

> ---
> This patch will not apply to net-next as it conflicts with other
> changes which have been done in the meantime. So I'll send a
> separate patch for net-next.
> ---
>  drivers/net/ethernet/realtek/r8169.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 604ae783..c7aac1fc 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -4981,6 +4981,9 @@ static void rtl_pll_power_down(struct rtl8169_private *tp)
>  static void rtl_pll_power_up(struct rtl8169_private *tp)
>  {
>         rtl_generic_op(tp, tp->pll_power_ops.up);
> +
> +       /* give MAC/PHY some time to resume */
> +       msleep(20);
>  }
>
>  static void rtl_init_pll_power_ops(struct rtl8169_private *tp)
> --
> 2.17.0
>

^ permalink raw reply

* Re: [PATCH v2] net: dsa: drop some VLAs in switch.c
From: Andrew Lunn @ 2018-05-07 19:26 UTC (permalink / raw)
  To: Salvatore Mesoraca
  Cc: Florian Fainelli, linux-kernel, Kernel Hardening, netdev,
	David S. Miller, Kees Cook, Vivien Didelot, David Laight
In-Reply-To: <CAJHCu1KsvPEs9vpp5bY04OeVfMtqZzPuO=9c8e2QP-+n+VKUjQ@mail.gmail.com>

> >> +++ b/include/net/dsa.h
> >> @@ -256,6 +256,9 @@ struct dsa_switch {
> >>       /* Number of switch port queues */
> >>       unsigned int            num_tx_queues;
> >>
> >> +     unsigned long           *bitmap;
> >> +     unsigned long           _bitmap;
> >> +
> >>       /* Dynamically allocated ports, keep last */
> >>       size_t num_ports;
> >>       struct dsa_port ports[];
> >> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> >> index adf50fb..cebf35f0 100644
> >> --- a/net/dsa/dsa2.c
> >> +++ b/net/dsa/dsa2.c
> >> @@ -748,6 +748,20 @@ struct dsa_switch *dsa_switch_alloc(struct device *dev, size_t n)
> >>       if (!ds)
> >>               return NULL;
> >>
> >> +     /* We avoid allocating memory outside dsa_switch
> >> +      * if it is not needed.
> >> +      */
> >> +     if (n <= sizeof(ds->_bitmap) * 8) {
> >> +             ds->bitmap = &ds->_bitmap;
> >
> > Should not this be / BITS_PER_BYTE? If the sizeof(unsigned long) is <=
> > 8, then you don't need to allocate it, otherwise, you have to.

> This optimization will save us an allocation when number of ports is
> less than 32 or 64 (depending on arch).
> IMHO it's useful, if you consider that, right now, DSA works only with
> 12-ports switches.

Do you have a feeling for the savings? I don't see it being very
large, and given the extra code, it might actually be negative.

    Andrew

^ permalink raw reply

* Re: [PATCH net] r8169: fix powering up RTL8168h
From: Andrew Lunn @ 2018-05-07 19:30 UTC (permalink / raw)
  To: ojab //
  Cc: Heiner Kallweit, David Miller, Realtek linux nic maintainers,
	netdev@vger.kernel.org
In-Reply-To: <CAKzrAgT8ZEx3rr8G1+GpnMxd1y0D8c=at0n12pRTXfSpzxc1ww@mail.gmail.com>

On Mon, May 07, 2018 at 07:20:53PM +0000, ojab // wrote:
> On Mon, May 7, 2018 at 7:11 PM, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> > Since commit a92a08499b1f "r8169: improve runtime pm in general and
> > suspend unused ports" interfaces w/o link are runtime-suspended after
> > 10s. On systems where drivers take longer to load this can lead to the
> > situation that the interface is runtime-suspended already when it's
> > initially brought up.
> > This shouldn't be a problem because rtl_open() resumes MAC/PHY.
> > However with at least one chip version the interface doesn't properly
> > come up, as reported here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=199549
> >
> > The vendor driver uses a delay to give certain chip versions some
> > time to resume before starting the PHY configuration. So let's do
> > the same. I don't know which chip versions may be affected,
> > therefore apply this delay always.
> >
> > This patch was reported to fix the issue for RTL8168h.
> > I was able to reproduce the issue on an Asus H310I-Plus which also
> > uses a RTL8168h. Also in my case the patch fixed the issue.
> >
> > Reported-by: Slava Kardakov <ojab@ojab.ru>
> > Tested-by: Slava Kardakov <ojab@ojab.ru>
> > Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> 
> Maybe also
> Cc: stable@vger.kernel.org # v4.16+

No need. Heiner correctly marked this for net, not net-next, so David
with do what is needed for it to goto stable.

     Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
From: Björn Töpel @ 2018-05-07 19:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Magnus Karlsson, Alexei Starovoitov, Daniel Borkmann,
	Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Willem de Bruijn,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180507150940.2578d6e3@redhat.com>

2018-05-07 15:09 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Mon, 7 May 2018 11:13:58 +0200
> Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
>
>> On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:
>> >> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
>> >> <alexei.starovoitov@gmail.com> wrote:
>> >> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> >> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> >> >> > From: Björn Töpel <bjorn.topel@intel.com>
>> >> >> >
>> >> >> > This patch set introduces a new address family called AF_XDP that is
>> >> >> > optimized for high performance packet processing and, in upcoming
>> >> >> > patch sets, zero-copy semantics. In this patch set, we have removed
>> >> >> > all zero-copy related code in order to make it smaller, simpler and
>> >> >> > hopefully more review friendly. This patch set only supports copy-mode
>> >> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> >> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> >> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> >> >> > work has already been accepted. We will publish our zero-copy support
>> >> >> > for RX and TX on top of his patch sets at a later point in time.
>> >> >>
>> >> >> +1, would be great to see it land this cycle. Saw few minor nits here
>> >> >> and there but nothing to hold it up, for the series:
>> >> >>
>> >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>> >> >>
>> >> >> Thanks everyone!
>> >> >
>> >> > Great stuff!
>> >> >
>> >> > Applied to bpf-next, with one condition.
>> >> > Upcoming zero-copy patches for both RX and TX need to be posted
>> >> > and reviewed within this release window.
>> >> > If netdev community as a whole won't be able to agree on the zero-copy
>> >> > bits we'd need to revert this feature before the next merge window.
>> >>
>> >> Thanks everyone for reviewing this. Highly appreciated.
>> >>
>> >> Just so we understand the purpose correctly:
>> >>
>> >> 1: Do you want to see the ZC patches in order to verify that the user
>> >> space API holds? If so, we can produce an additional RFC  patch set
>> >> using a big chunk of code that we had in RFC V1. We are not proud of
>> >> this code since it is clunky, but it hopefully proves the point with
>> >> the uapi being the same.
>> >>
>> >> 2: And/Or are you worried about us all (the netdev community) not
>> >> agreeing on a way to implement ZC internally in the drivers and the
>> >> XDP infrastructure? This is not going to be possible to finish during
>> >> this cycle since we do not like the implementation we had in RFC V1.
>> >> Too intrusive and now we also have nicer abstractions from Jesper that
>> >> we can use and extend to provide a (hopefully) much cleaner and less
>> >> intrusive solution.
>> >
>> > short answer: both.
>> >
>> > Cleanliness and performance of the ZC code is not as important as
>> > getting API right. The main concern that during ZC review process
>> > we will find out that existing API has issues, so we have to
>> > do this exercise before the merge window.
>> > And RFC won't fly. Send the patches for real. They have to go
>> > through the proper code review. The hackers of netdev community
>> > can accept a partial, or a bit unclean, or slightly inefficient
>> > implementation, since it can be and will be improved later,
>> > but API we cannot change once it goes into official release.
>> >
>> > Here is the example of API concern:
>> > this patch set added shared umem concept. It sounds good in theory,
>> > but will it perform well with ZC ? Earlier RFCs didn't have that
>> > feature. If it won't perform well than it shouldn't be in the tree.
>> > The key reason to let AF_XDP into the tree is its performance promise.
>> > If it doesn't perform we should rip it out and redesign.
>>
>> That is a fair point. We will try to produce patch sets for zero-copy
>> RX and TX using the latest interfaces within this merge window. Just
>> note that we will focus on this for the next week(s) instead of the
>> review items that you and Daniel Borkmann submitted. If we get those
>> patch sets out in time and we agree that they are a possible way
>> forward, then we produce patches with your fixes. It was mainly small
>> items, so should be quick.
>
> I would like to see that you create a new xdp_mem_type for this new
> zero-copy type. This will allow other XDP redirect methods/types (e.g.
> devmap and cpumap) to react appropriately when receiving a zero-copy
> frame.
>

Yes, that's the plan!

> For devmap, I'm hoping we can allow/support using the ndo_xdp_xmit call
> without (first) copying (into a newly allocated page).  By arguing that
> if an xsk-userspace app modify a frame it's not allowed to, then it is
> simply a bug in the program. (Note, this would also allow using
> ndo_xdp_xmit call for TX from xsk-userspace).
>

Makes sense. I think the ZC rational for Rx can indeed be extended for
devmap redirects -- i.e. no frame cloning is required.

> For cpumap, it is hard to avoid a copy, but I'm hoping we could delay
> the copy (and alloc of mem dest area) until on the remote CPU.  This is
> already the principle of cpumap; of moving the allocation of the SKB to
> the remote CPU.
>

I think for most AF_XDP applications that would like to pass frames to
the kernel, the cpumap would be preferred instead of XDP_PASS (moving
the stack execution to another off-AF_XDP-thread).

> For ZC to interact with XDP redirect-core and return API, the zero-copy
> memory type/allocator, need to provide an area for the xdp_frame data
> to be stored in (as we cannot allow using top-of-frame like
> non-zero-copy variants), and extend xdp_frame with an ZC umem-id.
> I imagine we can avoid any dynamic allocations, as we upfront (at bind
> and XDP_UMEM_REG time) know the number of frames.  (e.g. pre-alloc in
> xdp_umem_reg() call, and have xdp_umem_get_xdp_frame lookup func).
>

Yeah, we can allocate a kernel-side-only xdp_frame for each umem frame.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [net-next PATCH v3 4/6] udp: Partially unroll handling of first segment and last segment
From: Willem de Bruijn @ 2018-05-07 19:54 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Network Development, Willem de Bruijn, David Miller
In-Reply-To: <CAF=yD-JD=WCULdj_u0xjj3S9BDcbHhSCkWXCLHMx04U+Jgz55A@mail.gmail.com>

On Mon, May 7, 2018 at 2:57 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, May 7, 2018 at 2:08 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>>
>> This patch allows us to take care of unrolling the first segment and the
>> last segment of the loop for processing the segmented skb. Part of the
>> motivation for this is that it makes it easier to process the fact that the
>> first fame and all of the frames in between should be mostly identical
>> in terms of header data, and the last frame has differences in the length
>> and partial checksum.
>>
>> In addition I am dropping the header length calculation since we don't
>> really need it for anything but the last frame and it can be easily
>> obtained by just pulling the data_len and offset of tail from the transport
>> header.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>
> I'm not a fan of the more complicated control flow, as I pointed out
> before. It only seems to save one assignment to uh from segs.
>
> Both follow-up patches are now more complex, because they need
> to add the same code in two locations.

With that said, if you feel strongly, I don't object.

The removal of hdrlen and simplification of arguments is definitely
an improvement.

^ permalink raw reply

* Re: [net-next PATCH v3 4/6] udp: Partially unroll handling of first segment and last segment
From: Alexander Duyck @ 2018-05-07 19:59 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, Willem de Bruijn, David Miller
In-Reply-To: <CAF=yD-KTkmx2jzGAeNVLGWmedCEBx8RQWjJBKZzERGCddEnKWw@mail.gmail.com>

On Mon, May 7, 2018 at 12:54 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, May 7, 2018 at 2:57 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Mon, May 7, 2018 at 2:08 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>>>
>>> This patch allows us to take care of unrolling the first segment and the
>>> last segment of the loop for processing the segmented skb. Part of the
>>> motivation for this is that it makes it easier to process the fact that the
>>> first fame and all of the frames in between should be mostly identical
>>> in terms of header data, and the last frame has differences in the length
>>> and partial checksum.
>>>
>>> In addition I am dropping the header length calculation since we don't
>>> really need it for anything but the last frame and it can be easily
>>> obtained by just pulling the data_len and offset of tail from the transport
>>> header.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>
>> I'm not a fan of the more complicated control flow, as I pointed out
>> before. It only seems to save one assignment to uh from segs.
>>
>> Both follow-up patches are now more complex, because they need
>> to add the same code in two locations.
>
> With that said, if you feel strongly, I don't object.
>
> The removal of hdrlen and simplification of arguments is definitely
> an improvement.

Thanks for being understanding about this.

My preference is to keep the loop unrolled as it is since that way it
is not too different from the way we handle this for TCP so it will
maintenance of the two easier. Otherwise I have to add a bunch of
conditional checks inside the loop.

The other advantage to unrolling it as I did is that I don't have to
deal with a ton of extra indentation for an if statement inside of a
while loop.

- Alex

^ permalink raw reply

* Re: [PATCH net-next v8 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: David Miller @ 2018-05-07 20:03 UTC (permalink / raw)
  To: rahul.lakkireddy
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, viro, ebiederm,
	stephen, akpm, torvalds, ganeshgr, nirranjan, indranil
In-Reply-To: <cover.1525253481.git.rahul.lakkireddy@chelsio.com>

From: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Date: Wed,  2 May 2018 15:17:16 +0530

> This series of patches add new generic framework that enable device
> drivers to collect device specific snapshot of the hardware/firmware
> state of the underlying device in the crash recovery kernel. In crash
> recovery kernel, the collected logs are added as elf notes to
> /proc/vmcore, which is copied by user space scripts for post-analysis.

Eric B., since you've been giving very useful and active feedback on
this series could you please give it a review?

Thank you.

^ permalink raw reply

* Re: [net-next PATCH v3 0/6] UDP GSO Segmentation clean-ups
From: Alexander Duyck @ 2018-05-07 20:03 UTC (permalink / raw)
  To: Netdev, Willem de Bruijn, David Miller

On Mon, May 7, 2018 at 11:08 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> This patch set addresses a number of issues I found while sorting out
> enabling UDP GSO Segmentation support for ixgbe/ixgbevf. Specifically there
> were a number of issues related to the checksum and such that seemed to
> cause either minor irregularities or kernel panics in the case of the
> offload request being allowed to traverse between name spaces.
>
> With this set applied I am was able to get UDP GSO traffic to pass over
> vxlan tunnels in both offloaded modes and non-offloaded modes for ixgbe and
> ixgbevf.
>
> I submitted the driver specific patches earlier as an RFC:
> https://patchwork.ozlabs.org/project/netdev/list/?series=42477&archive=both&state=*
>
> v2: Updated patches based on feedback from Eric Dumazet
>     Split first patch into several patches based on feedback from Eric
> v3: Drop patch that was calling pskb_may_pull as it was redundant.
>     Added code to use MANGLED_0 in case of UDP checksum
>     Drop patch adding NETIF_F_GSO_UDP_L4 to list of GSO software offloads
>     Added Acked-by for patches reviewed by Willem and not changed

Just noticed I forgot to update the subject before sending out the
cover page. I updated it for this reply. If needed I will submit a v4,
but for now I will leave this out here to finish up review.

Thanks.

- Alex

^ permalink raw reply

* Re: [net-next PATCH v3 4/6] udp: Partially unroll handling of first segment and last segment
From: Willem de Bruijn @ 2018-05-07 20:10 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Network Development, Willem de Bruijn, David Miller
In-Reply-To: <CAKgT0UdBgTnvGBR9Y4KSsKBfebkOiXKJROr8uAMBW2sjFoHadw@mail.gmail.com>

On Mon, May 7, 2018 at 3:59 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, May 7, 2018 at 12:54 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Mon, May 7, 2018 at 2:57 PM, Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>> On Mon, May 7, 2018 at 2:08 PM, Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>>>>
>>>> This patch allows us to take care of unrolling the first segment and the
>>>> last segment of the loop for processing the segmented skb. Part of the
>>>> motivation for this is that it makes it easier to process the fact that the
>>>> first fame and all of the frames in between should be mostly identical
>>>> in terms of header data, and the last frame has differences in the length
>>>> and partial checksum.
>>>>
>>>> In addition I am dropping the header length calculation since we don't
>>>> really need it for anything but the last frame and it can be easily
>>>> obtained by just pulling the data_len and offset of tail from the transport
>>>> header.
>>>>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Acked-by: Willem de Bruijn <willemb@google.com>

>>> I'm not a fan of the more complicated control flow, as I pointed out
>>> before. It only seems to save one assignment to uh from segs.
>>>
>>> Both follow-up patches are now more complex, because they need
>>> to add the same code in two locations.
>>
>> With that said, if you feel strongly, I don't object.
>>
>> The removal of hdrlen and simplification of arguments is definitely
>> an improvement.
>
> Thanks for being understanding about this.
>
> My preference is to keep the loop unrolled as it is since that way it
> is not too different from the way we handle this for TCP so it will
> maintenance of the two easier. Otherwise I have to add a bunch of
> conditional checks inside the loop.
>
> The other advantage to unrolling it as I did is that I don't have to
> deal with a ton of extra indentation for an if statement inside of a
> while loop.

Both good reasons. Thanks a lot for the overall cleanup.

^ permalink raw reply

* Re: [PATCH] openvswitch: fix internal_dev_xmit()'s return type
From: Gregory Rose @ 2018-05-07 20:42 UTC (permalink / raw)
  To: Luc Van Oostenryck, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	David S. Miller
In-Reply-To: <20180424131747.4711-1-luc.vanoostenryck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On 4/24/2018 6:17 AM, Luc Van Oostenryck wrote:
> The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
> which is a typedef for an enum type, but the implementation in this
> driver returns an 'int'.
>
> Fix this by returning 'netdev_tx_t' in this driver too.
>
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>   net/openvswitch/vport-internal_dev.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c
> index bb95c43aa..3ea55618e 100644
> --- a/net/openvswitch/vport-internal_dev.c
> +++ b/net/openvswitch/vport-internal_dev.c
> @@ -43,7 +43,7 @@ static struct internal_dev *internal_dev_priv(struct net_device *netdev)
>   }
>   
>   /* Called with rcu_read_lock_bh. */
> -static int internal_dev_xmit(struct sk_buff *skb, struct net_device *netdev)
> +static netdev_tx_t internal_dev_xmit(struct sk_buff *skb, struct net_device *netdev)
>   {
>   	int len, err;
>   

LGTM

Reviewed-by: Greg Rose <gvrose8192-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

^ permalink raw reply

* Re: [PATCH] openvswitch: make vport_ops:send()'s return type consistent
From: Gregory Rose @ 2018-05-07 20:44 UTC (permalink / raw)
  To: Luc Van Oostenryck, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	David S. Miller
In-Reply-To: <20180424131953.6474-1-luc.vanoostenryck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On 4/24/2018 6:19 AM, Luc Van Oostenryck wrote:
> The method struct vport_ops:send() is defined as returning an
> 'netdev_tx_t', which is defined as a typedef for a bitwise type
> and otherwise used for the start_xmit() methods.
> However, most openvswitch drivers use for this method dev_queue_xmit()
> which returns an 'int' and the return value of vport_ops:send() is
> in fact never used.
>
> Make things typewise consistent and use 'int' for vport_ops:send()
> as well for internal_dev_recv() (which is the only proper send method)
> as using 'netdev_tx_t' doesn't offer any advantages and in fact seems,
> if not wrong at least, inadequate.
>
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
> ---
>   net/openvswitch/vport-internal_dev.c | 6 +++---
>   net/openvswitch/vport.h              | 2 +-
>   2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c
> index 3ea55618e..2fd68c2fb 100644
> --- a/net/openvswitch/vport-internal_dev.c
> +++ b/net/openvswitch/vport-internal_dev.c
> @@ -231,7 +231,7 @@ static void internal_dev_destroy(struct vport *vport)
>   	rtnl_unlock();
>   }
>   
> -static netdev_tx_t internal_dev_recv(struct sk_buff *skb)
> +static int internal_dev_recv(struct sk_buff *skb)
>   {
>   	struct net_device *netdev = skb->dev;
>   	struct pcpu_sw_netstats *stats;
> @@ -239,7 +239,7 @@ static netdev_tx_t internal_dev_recv(struct sk_buff *skb)
>   	if (unlikely(!(netdev->flags & IFF_UP))) {
>   		kfree_skb(skb);
>   		netdev->stats.rx_dropped++;
> -		return NETDEV_TX_OK;
> +		return 0;
>   	}
>   
>   	skb_dst_drop(skb);
> @@ -257,7 +257,7 @@ static netdev_tx_t internal_dev_recv(struct sk_buff *skb)
>   	u64_stats_update_end(&stats->syncp);
>   
>   	netif_rx(skb);
> -	return NETDEV_TX_OK;
> +	return 0;
>   }
>   
>   static struct vport_ops ovs_internal_vport_ops = {
> diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
> index cda66c26a..8dcb48fe8 100644
> --- a/net/openvswitch/vport.h
> +++ b/net/openvswitch/vport.h
> @@ -141,7 +141,7 @@ struct vport_ops {
>   	int (*set_options)(struct vport *, struct nlattr *);
>   	int (*get_options)(const struct vport *, struct sk_buff *);
>   
> -	netdev_tx_t (*send) (struct sk_buff *skb);
> +	int (*send) (struct sk_buff *skb);
>   	struct module *owner;
>   	struct list_head list;
>   };

Yes, it does seem odd to use a tx return type for receive.  Nice fixup.

Reviewed-by: Greg Rose <gvrose8192@gmail.com>

_______________________________________________
dev mailing list
dev@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

^ permalink raw reply

* Re: The SO_BINDTODEVICE was set to the desired interface, but packets are received from all interfaces.
From: David Ahern @ 2018-05-07 21:20 UTC (permalink / raw)
  To: Ben Greear, Damir Mansurov, netdev
  Cc: Konstantin Ushakov, Alexandra N. Kossovsky, Andrey Dmitrov
In-Reply-To: <5361bef8-bdf9-af3d-12ae-a128b6502d2e@candelatech.com>

On 5/7/18 10:14 AM, Ben Greear wrote:
> On 05/07/2018 03:19 AM, Damir Mansurov wrote:
>>
>> Greetings,
>>
>> After successful call of the setsockopt(SO_BINDTODEVICE) function to
>> set data reception from only one interface, the data is still received
>> from all interfaces.
>> Function setsockopt() returns 0 but then recv() receives data from all
>> available network interfaces.
>>
>> The problem is reproducible on linux kernels 4.14 - 4.16, but it does
>> not on linux kernels 4.4, 4.13.
>>
>> I have written C-code to reproduce this issue (see attached files
>> b2d_send.c and b2d_recv.c). See below explanation of tested
>> configuration.
> 
> Hello,
> 
> I am not sure if this is your problem or not, but if you are using VRF,
> then you need
> to call SO_BINDTODEVICE before you do the 'normal' bind() call.
> 

This is a different problem -- socket lookup is matching when it should not.

^ permalink raw reply

* Re: RTL8723BE performance regression
From: João Paulo Rechi Vita @ 2018-05-07 21:49 UTC (permalink / raw)
  To: Pkshih
  Cc: Larry.Finger@lwfinger.net, linux-kernel@vger.kernel.org,
	jprvita@endlessm.com, Birming Chiu, drake@endlessm.com,
	Chaoming_Li, kvalo@codeaurora.org, 莊彥宣,
	derosier@gmail.com, Steven Ting, netdev@vger.kernel.org,
	linux@endlessm.com, Shaofu, linux-wireless@vger.kernel.org
In-Reply-To: <1525240713.3735.3.camel@realtek.com>

On Tue, May 1, 2018 at 10:58 PM, Pkshih <pkshih@realtek.com> wrote:
> On Wed, 2018-05-02 at 05:44 +0000, Pkshih wrote:
>>
>> > -----Original Message-----
>> > From: João Paulo Rechi Vita [mailto:jprvita@gmail.com]
>> > Sent: Wednesday, May 02, 2018 6:41 AM
>> > To: Larry Finger
>> > Cc: Steve deRosier; 莊彥宣; Pkshih; Birming Chiu; Shaofu; Steven Ting; Chaoming_Li; Kalle Valo;
>> > linux-wireless; Network Development; LKML; Daniel Drake; João Paulo Rechi Vita; linux@endlessm.c
>> om
>> > Subject: Re: RTL8723BE performance regression
>> >
>> > On Tue, Apr 3, 2018 at 7:51 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
>> > > On 04/03/2018 09:37 PM, João Paulo Rechi Vita wrote:
>> > >>
>> > >> On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net>
>> > >> wrote:
>> > >>
>> > >> (...)
>> > >>
>> > >>> As the antenna selection code changes affected your first bisection, do
>> > >>> you
>> > >>> have one of those HP laptops with only one antenna and the incorrect
>> > >>> coding
>> > >>> in the FUSE?
>> > >>
>> > >>
>> > >> Yes, that is why I've been passing ant_sel=1 during my tests -- this
>> > >> was needed to achieve a good performance in the past, before this
>> > >> regression. I've also opened the laptop chassis and confirmed the
>> > >> antenna cable is plugged to the connector labeled with "1" on the
>> > >> card.
>> > >>
>> > >>> If so, please make sure that you still have the same signal
>> > >>> strength for good and bad cases. I have tried to keep the driver and the
>> > >>> btcoex code in sync, but there may be some combinations of antenna
>> > >>> configuration and FUSE contents that cause the code to fail.
>> > >>>
>> > >>
>> > >> What is the recommended way to monitor the signal strength?
>> > >
>> > >
>> > > The btcoex code is developed for multiple platforms by a different group
>> > > than the Linux driver. I think they made a change that caused ant_sel to
>> > > switch from 1 to 2. At least numerous comments at
>> > > github.com/lwfinger/rtlwifi_new claimed they needed to make that change.
>> > >
>> > > Mhy recommended method is to verify the wifi device name with "iw dev". Then
>> > > using that device
>> > >
>> > > sudo iw dev <dev_name> scan | egrep "SSID|signal"
>> > >
>> >
>> > I have confirmed that the performance regression is indeed tied to
>> > signal strength: on the good cases signal was between -16 and -8 dBm,
>> > whereas in bad cases signal was always between -50 to - 40 dBm. I've
>> > also switched to testing bandwidth in controlled LAN environment using
>> > iperf3, as suggested by Steve deRosier, with the DUT being the only
>> > machine connected to the 2.4 GHz radio and the machine running the
>> > iperf3 server connected via ethernet.
>> >
>>
>> We have new experimental results in commit af8a41cccf8f46 ("rtlwifi: cleanup
>> 8723be ant_sel definition"). You can use the above commit and do the same
>> experiments (with ant_sel=0, 1 and 2) in your side, and then share your results.
>> Since performance is tied to signal strength, you can only share signal strength.
>>
>
> Please pay attention to cold reboot once ant_sel is changed.
>

I've tested the commit mentioned above and it fixes the problem on top
of v4.16 (in addition to the latest wireless-drivers-next also been
fixed as it already contains such commit). On v4.15, we also need the
following commits before "af8a41cccf8f rtlwifi: cleanup 8723be ant_sel
definition" to have a good performance again:

  874e837d67d0 rtlwifi: fill FW version and subversion
  a44709bba70f rtlwifi: btcoex: Add power_on_setting routine
  40d9dd4f1c5d rtlwifi: btcoex: Remove global variables from btcoex

Surprisingly, it seems forcing ant_sel=1 is not needed anymore on
these machines, as the shown by the numbers bellow (ant_sel=0 means
that actually no parameter was passed to the module). I have powered
off the machine and done a cold boot for every test. It seems
something have changed in the antenna auto-selection code since v4.11,
the latest point where I could confirm we definitely need to force
ant_sel=1. I've been trying to understand what causes this difference,
but haven't made progress on that so far, so any suggestions are
appreciated (we are trying to decide if we can confidently drop the
downstream DMI quirks for these specific machines).

  w-d-n ant_sel=0: -14.00 dBm,  69.5 Mbps -> good
  w-d-n ant_sel=1: -10.00 dBm,  41.1 Mbps -> good
  w-d-n ant_sel=2: -44.00 dBm,   607 kbps -> bad

  v4.16 ant_sel=0: -12.00 dBm,  63.0 Mbps -> good
  v4.16 ant_sel=1: - 8.00 dBm,  69.0 Mbps -> good
  v4.16 ant_sel=2: -50.00 dBm,   224 kbps -> bad

  v4.15 ant_sel=0: - 8.00 dBm,  33.0 Mbps -> good
  v4.15 ant_sel=1: -10.00 dBm,  38.1 Mbps -> good
  v4.15 ant_sel=2: -48.00 dBm,   206 kbps -> bad

--
João Paulo Rechi Vita
http://about.me/jprvita

^ permalink raw reply

* [PATCH net-next v10 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
From: Sridhar Samudrala @ 2018-05-07 22:10 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1525731046-10989-1-git-send-email-sridhar.samudrala@intel.com>

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f34794a76c4d..213fddc70fd0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..a3715a3224c1 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v10 2/4] net: Introduce generic failover module
From: Sridhar Samudrala @ 2018-05-07 22:10 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1525731046-10989-1-git-send-email-sridhar.samudrala@intel.com>

This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation.

It exposes 2 sets of interfaces to the paravirtual drivers.
1. For paravirtual drivers like virtio_net that use 3 netdev model, the
   the failover module provides interfaces to create/destroy additional
   master netdev and all the slave events are managed internally.
          net_failover_create()
          net_failover_destroy()
   A failover netdev is created that acts a master device and controls 2
   slave devices. The original virtio_net netdev is registered as 'standby'
   netdev and a passthru/vf device with the same MAC gets registered as
   'primary' netdev. Both 'standby' and 'failover' netdevs are associated
   with the same 'pci' device.  The user accesses the network interface via
   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
   default for transmits when it is available with link up and running.
2. For existing netvsc driver that uses 2 netdev model, no master netdev
   is created. The paravirtual driver registers each instance of netvsc
   as a 'failover' netdev  along with a set of ops to manage the slave
   events. There is no 'standby' netdev in this model. A passthru/vf device
   with the same MAC gets registered as 'primary' netdev.
          net_failover_register()
          net_failover_unregister()

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 MAINTAINERS                |    7 +
 include/linux/netdevice.h  |   16 +
 include/net/net_failover.h |   52 +++
 net/Kconfig                |   10 +
 net/core/Makefile          |    1 +
 net/core/net_failover.c    | 1044 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 1130 insertions(+)
 create mode 100644 include/net/net_failover.h
 create mode 100644 net/core/net_failover.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ebe0b9ed7805..83cbd99d8efa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9638,6 +9638,13 @@ S:	Maintained
 F:	Documentation/hwmon/nct6775
 F:	drivers/hwmon/nct6775.c
 
+NET_FAILOVER MODULE
+M:	Sridhar Samudrala <sridhar.samudrala@intel.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	net/core/net_failover.c
+F:	include/net/net_failover.h
+
 NETEFFECT IWARP RNIC DRIVER (IW_NES)
 M:	Faisal Latif <faisal.latif@intel.com>
 L:	linux-rdma@vger.kernel.org
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 46dcb5f7522f..4fff9b5d079e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1421,6 +1421,8 @@ struct net_device_ops {
  *	entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
+ * @IFF_FAILOVER: device is a failover master device
+ * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
 	IFF_PHONY_HEADROOM		= 1<<24,
 	IFF_MACSEC			= 1<<25,
 	IFF_NO_RX_HANDLER		= 1<<26,
+	IFF_FAILOVER			= 1<<27,
+	IFF_FAILOVER_SLAVE		= 1<<28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1478,6 +1482,8 @@ enum netdev_priv_flags {
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
 #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
+#define IFF_FAILOVER			IFF_FAILOVER
+#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -4320,6 +4326,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
 }
 
+static inline bool netif_is_failover(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_FAILOVER;
+}
+
+static inline bool netif_is_failover_slave(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_FAILOVER_SLAVE;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/net_failover.h b/include/net/net_failover.h
new file mode 100644
index 000000000000..221c2aff7531
--- /dev/null
+++ b/include/net/net_failover.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_FAILOVER_H
+#define _NET_FAILOVER_H
+
+#include <linux/netdevice.h>
+
+struct net_failover_ops {
+	int (*slave_register)(struct net_device *slave_dev,
+			      struct net_device *failover_dev);
+	int (*slave_unregister)(struct net_device *slave_dev,
+				struct net_device *failover_dev);
+	int (*slave_link_change)(struct net_device *slave_dev,
+				 struct net_device *failover_dev);
+};
+
+struct net_failover {
+	struct list_head list;
+	struct net_device __rcu *failover_dev;
+	struct net_failover_ops __rcu *ops;
+};
+
+/* failover state */
+struct net_failover_info {
+	/* primary netdev with same MAC */
+	struct net_device __rcu *primary_dev;
+
+	/* standby netdev */
+	struct net_device __rcu *standby_dev;
+
+	/* primary netdev stats */
+	struct rtnl_link_stats64 primary_stats;
+
+	/* standby netdev stats */
+	struct rtnl_link_stats64 standby_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 failover_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
+struct net_failover *net_failover_create(struct net_device *standby_dev);
+void net_failover_destroy(struct net_failover *failover);
+struct net_failover *net_failover_register(struct net_device *dev,
+					   struct net_failover_ops *ops);
+void net_failover_unregister(struct net_failover *failover);
+int net_failover_slave_unregister(struct net_device *slave_dev);
+
+#endif /* _NET_FAILOVER_H */
diff --git a/net/Kconfig b/net/Kconfig
index b62089fb1332..0540856676de 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -429,6 +429,16 @@ config MAY_USE_DEVLINK
 config PAGE_POOL
        bool
 
+config NET_FAILOVER
+	tristate "Failover interface"
+	default m
+	help
+	  This provides a generic interface for paravirtual drivers to listen
+	  for netdev register/unregister/link change events from pci ethernet
+	  devices with the same MAC and takeover their datapath. This also
+	  enables live migration of a VM with direct attached VF by failing
+	  over to the paravirtual datapath when the VF is unplugged.
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 7080417f8bc8..283ed9b0e581 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_NET_FAILOVER) += net_failover.o
diff --git a/net/core/net_failover.c b/net/core/net_failover.c
new file mode 100644
index 000000000000..8d60e74e3034
--- /dev/null
+++ b/net/core/net_failover.c
@@ -0,0 +1,1044 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver and failover
+ * netdev management routines are based on bond/team driver.
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <linux/pci.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/net_failover.h>
+
+static LIST_HEAD(net_failover_list);
+static DEFINE_SPINLOCK(net_failover_lock);
+
+static bool net_failover_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static int net_failover_open(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int err;
+
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		err = dev_open(primary_dev);
+		if (err)
+			goto err_primary_open;
+	}
+
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		err = dev_open(standby_dev);
+		if (err)
+			goto err_standby_open;
+	}
+
+	if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
+	    (standby_dev && net_failover_xmit_ready(standby_dev))) {
+		netif_carrier_on(dev);
+		netif_tx_wake_all_queues(dev);
+	}
+
+	return 0;
+
+err_standby_open:
+	dev_close(primary_dev);
+err_primary_open:
+	netif_tx_disable(dev);
+	return err;
+}
+
+static int net_failover_close(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	netif_tx_disable(dev);
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		dev_close(slave_dev);
+
+	slave_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		dev_close(slave_dev);
+
+	return 0;
+}
+
+static netdev_tx_t net_failover_drop_xmit(struct sk_buff *skb,
+					  struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static netdev_tx_t net_failover_start_xmit(struct sk_buff *skb,
+					   struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via primary netdev followed by standby netdev */
+	xmit_dev = rcu_dereference_bh(nfo_info->primary_dev);
+	if (!xmit_dev || !net_failover_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(nfo_info->standby_dev);
+		if (!xmit_dev || !net_failover_xmit_ready(xmit_dev))
+			return net_failover_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16 net_failover_select_queue(struct net_device *dev,
+				     struct sk_buff *skb, void *accel_priv,
+				     select_queue_fallback_t fallback)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev;
+	u16 txq;
+
+	rcu_read_lock();
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		const struct net_device_ops *ops = primary_dev->netdev_ops;
+
+		if (ops->ndo_select_queue)
+			txq = ops->ndo_select_queue(primary_dev, skb,
+						    accel_priv, fallback);
+		else
+			txq = fallback(primary_dev, skb);
+
+		qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+		return txq;
+	}
+
+	txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void net_failover_fold_stats(struct rtnl_link_stats64 *_res,
+				    const struct rtnl_link_stats64 *_new,
+				    const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void net_failover_get_stats(struct net_device *dev,
+				   struct rtnl_link_stats64 *stats)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *slave_dev;
+
+	spin_lock(&nfo_info->stats_lock);
+	memcpy(stats, &nfo_info->failover_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev) {
+		new = dev_get_stats(slave_dev, &temp);
+		net_failover_fold_stats(stats, new, &nfo_info->primary_stats);
+		memcpy(&nfo_info->primary_stats, new, sizeof(*new));
+	}
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev) {
+		new = dev_get_stats(slave_dev, &temp);
+		net_failover_fold_stats(stats, new, &nfo_info->standby_stats);
+		memcpy(&nfo_info->standby_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&nfo_info->failover_stats, stats, sizeof(*stats));
+	spin_unlock(&nfo_info->stats_lock);
+}
+
+static int net_failover_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int ret = 0;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		ret = dev_set_mtu(primary_dev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		ret = dev_set_mtu(standby_dev, new_mtu);
+		if (ret) {
+			if (primary_dev)
+				dev_set_mtu(primary_dev, dev->mtu);
+			return ret;
+		}
+	}
+
+	dev->mtu = new_mtu;
+
+	return 0;
+}
+
+static void net_failover_set_rx_mode(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	rcu_read_lock();
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev) {
+		dev_uc_sync_multiple(slave_dev, dev);
+		dev_mc_sync_multiple(slave_dev, dev);
+	}
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev) {
+		dev_uc_sync_multiple(slave_dev, dev);
+		dev_mc_sync_multiple(slave_dev, dev);
+	}
+
+	rcu_read_unlock();
+}
+
+static int net_failover_vlan_rx_add_vid(struct net_device *dev, __be16 proto,
+					u16 vid)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int ret = 0;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		ret = vlan_vid_add(primary_dev, proto, vid);
+		if (ret)
+			return ret;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		ret = vlan_vid_add(standby_dev, proto, vid);
+		if (ret)
+			if (primary_dev)
+				vlan_vid_del(primary_dev, proto, vid);
+	}
+
+	return ret;
+}
+
+static int net_failover_vlan_rx_kill_vid(struct net_device *dev, __be16 proto,
+					 u16 vid)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		vlan_vid_del(slave_dev, proto, vid);
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		vlan_vid_del(slave_dev, proto, vid);
+
+	return 0;
+}
+
+static const struct net_device_ops failover_dev_ops = {
+	.ndo_open		= net_failover_open,
+	.ndo_stop		= net_failover_close,
+	.ndo_start_xmit		= net_failover_start_xmit,
+	.ndo_select_queue	= net_failover_select_queue,
+	.ndo_get_stats64	= net_failover_get_stats,
+	.ndo_change_mtu		= net_failover_change_mtu,
+	.ndo_set_rx_mode	= net_failover_set_rx_mode,
+	.ndo_vlan_rx_add_vid	= net_failover_vlan_rx_add_vid,
+	.ndo_vlan_rx_kill_vid	= net_failover_vlan_rx_kill_vid,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+#define FAILOVER_NAME "net_failover"
+#define FAILOVER_VERSION "0.1"
+
+static void nfo_ethtool_get_drvinfo(struct net_device *dev,
+				    struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
+}
+
+static int nfo_ethtool_get_link_ksettings(struct net_device *dev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (!slave_dev || !net_failover_xmit_ready(slave_dev)) {
+		slave_dev = rtnl_dereference(nfo_info->standby_dev);
+		if (!slave_dev || !net_failover_xmit_ready(slave_dev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(slave_dev, cmd);
+}
+
+static const struct ethtool_ops failover_ethtool_ops = {
+	.get_drvinfo            = nfo_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = nfo_ethtool_get_link_ksettings,
+};
+
+static struct net_device *net_failover_get_bymac(u8 *mac,
+						 struct net_failover_ops **ops)
+{
+	struct net_device *failover_dev;
+	struct net_failover *failover;
+
+	spin_lock(&net_failover_lock);
+	list_for_each_entry(failover, &net_failover_list, list) {
+		failover_dev = rtnl_dereference(failover->failover_dev);
+		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
+			*ops = rtnl_dereference(failover->ops);
+			spin_unlock(&net_failover_lock);
+			return failover_dev;
+		}
+	}
+	spin_unlock(&net_failover_lock);
+	return NULL;
+}
+
+/* Called when slave dev is injecting data into network stack.
+ * Change the associated network device from lower dev to failover dev.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t net_failover_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *dev = rcu_dereference(skb->dev->rx_handler_data);
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+
+	if (primary_dev && skb->dev == standby_dev)
+		return RX_HANDLER_EXACT;
+
+	skb->dev = dev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
+				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
+				 NETIF_F_HIGHDMA | NETIF_F_LRO)
+
+#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
+				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
+
+static void net_failover_compute_features(struct net_device *dev)
+{
+	u32 vlan_features = FAILOVER_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
+	netdev_features_t enc_features  = FAILOVER_ENC_FEATURES;
+	unsigned short max_hard_header_len = ETH_HLEN;
+	unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |
+					IFF_XMIT_DST_RELEASE_PERM;
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		vlan_features =
+			netdev_increment_features(vlan_features,
+						  primary_dev->vlan_features,
+						  FAILOVER_VLAN_FEATURES);
+		enc_features =
+			netdev_increment_features(enc_features,
+						  primary_dev->hw_enc_features,
+						  FAILOVER_ENC_FEATURES);
+
+		dst_release_flag &= primary_dev->priv_flags;
+		if (primary_dev->hard_header_len > max_hard_header_len)
+			max_hard_header_len = primary_dev->hard_header_len;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		vlan_features =
+			netdev_increment_features(vlan_features,
+						  standby_dev->vlan_features,
+						  FAILOVER_VLAN_FEATURES);
+		enc_features =
+			netdev_increment_features(enc_features,
+						  standby_dev->hw_enc_features,
+						  FAILOVER_ENC_FEATURES);
+
+		dst_release_flag &= standby_dev->priv_flags;
+		if (standby_dev->hard_header_len > max_hard_header_len)
+			max_hard_header_len = standby_dev->hard_header_len;
+	}
+
+	dev->vlan_features = vlan_features;
+	dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
+	dev->hard_header_len = max_hard_header_len;
+
+	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+	if (dst_release_flag == (IFF_XMIT_DST_RELEASE |
+				 IFF_XMIT_DST_RELEASE_PERM))
+		dev->priv_flags |= IFF_XMIT_DST_RELEASE;
+
+	netdev_change_features(dev);
+}
+
+static void net_failover_lower_state_changed(struct net_device *slave_dev,
+					     struct net_device *primary_dev,
+					     struct net_device *standby_dev)
+{
+	struct netdev_lag_lower_state_info info;
+
+	if (netif_carrier_ok(slave_dev))
+		info.link_up = true;
+	else
+		info.link_up = false;
+
+	if (slave_dev == primary_dev) {
+		if (netif_running(primary_dev))
+			info.tx_enabled = true;
+		else
+			info.tx_enabled = false;
+	} else {
+		if ((primary_dev && netif_running(primary_dev)) ||
+		    (!netif_running(standby_dev)))
+			info.tx_enabled = false;
+		else
+			info.tx_enabled = true;
+	}
+
+	netdev_lower_state_changed(slave_dev, &info);
+}
+
+/**
+ * net_failover_slave_register - Register a slave netdev
+ *
+ * @slave_dev: slave netdev that is being registered
+ *
+ * Registers a slave device to a failover instance. For a 2 netdev model,
+ * this will be primary netdev. In case of a 3 netdev model, it can be a
+ * standby or a primary netdev.
+ *
+ */
+static int net_failover_slave_register(struct net_device *slave_dev)
+{
+	struct net_device *standby_dev, *primary_dev, *failover_dev;
+	struct netdev_lag_upper_info lag_upper_info;
+	struct net_failover_info *nfo_info;
+	struct net_failover_ops *nfo_ops;
+	bool slave_is_standby;
+	u32 orig_mtu;
+	int err;
+
+	ASSERT_RTNL();
+
+	failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
+	if (!failover_dev)
+		goto done;
+
+	if (failover_dev->type != slave_dev->type)
+		goto done;
+
+	if (nfo_ops && nfo_ops->slave_register)
+		return nfo_ops->slave_register(slave_dev, failover_dev);
+
+	nfo_info = netdev_priv(failover_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
+	if (slave_is_standby ? standby_dev : primary_dev) {
+		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
+			   slave_dev->name,
+			   slave_is_standby ? "standby" : "primary");
+		goto done;
+	}
+
+	/* We want to allow only a direct attached VF device as a primary
+	 * netdev. As there is no easy way to check for a VF device, restrict
+	 * this to a pci device.
+	 */
+	if (!slave_is_standby && (!slave_dev->dev.parent ||
+				  !dev_is_pci(slave_dev->dev.parent)))
+		goto done;
+
+	if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
+	    vlan_uses_dev(failover_dev)) {
+		netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
+			   failover_dev->name);
+		goto done;
+	}
+
+	/* Align MTU of slave with failover dev */
+	orig_mtu = slave_dev->mtu;
+	err = dev_set_mtu(slave_dev, failover_dev->mtu);
+	if (err) {
+		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
+			   slave_dev->name, failover_dev->mtu);
+		goto done;
+	}
+
+	dev_hold(slave_dev);
+
+	if (netif_running(failover_dev)) {
+		err = dev_open(slave_dev);
+		if (err && (err != -EBUSY)) {
+			netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
+				   slave_dev->name, err);
+			goto err_dev_open;
+		}
+	}
+
+	netif_addr_lock_bh(failover_dev);
+	dev_uc_sync_multiple(slave_dev, failover_dev);
+	dev_uc_sync_multiple(slave_dev, failover_dev);
+	netif_addr_unlock_bh(failover_dev);
+
+	err = vlan_vids_add_by_dev(slave_dev, failover_dev);
+	if (err) {
+		netdev_err(failover_dev, "Failed to add vlan ids to device %s err:%d\n",
+			   slave_dev->name, err);
+		goto err_vlan_add;
+	}
+
+	err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
+					 failover_dev);
+	if (err) {
+		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
+			   err);
+		goto err_handler_register;
+	}
+
+	lag_upper_info.tx_type = NETDEV_LAG_TX_TYPE_ACTIVEBACKUP;
+	err = netdev_master_upper_dev_link(slave_dev, failover_dev, NULL,
+					   &lag_upper_info, NULL);
+	if (err) {
+		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
+			   failover_dev->name, err);
+		goto err_upper_link;
+	}
+
+	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
+
+	if (slave_is_standby) {
+		rcu_assign_pointer(nfo_info->standby_dev, slave_dev);
+		standby_dev = slave_dev;
+		dev_get_stats(standby_dev, &nfo_info->standby_stats);
+	} else {
+		rcu_assign_pointer(nfo_info->primary_dev, slave_dev);
+		primary_dev = slave_dev;
+		dev_get_stats(primary_dev, &nfo_info->primary_stats);
+		failover_dev->min_mtu = slave_dev->min_mtu;
+		failover_dev->max_mtu = slave_dev->max_mtu;
+	}
+
+	net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
+	net_failover_compute_features(failover_dev);
+
+	call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
+
+	netdev_info(failover_dev, "failover %s slave:%s registered\n",
+		    slave_is_standby ? "standby" : "primary", slave_dev->name);
+
+	goto done;
+
+err_upper_link:
+	netdev_rx_handler_unregister(slave_dev);
+err_handler_register:
+	vlan_vids_del_by_dev(slave_dev, failover_dev);
+err_vlan_add:
+	dev_uc_unsync(slave_dev, failover_dev);
+	dev_mc_unsync(slave_dev, failover_dev);
+	dev_close(slave_dev);
+err_dev_open:
+	dev_put(slave_dev);
+	dev_set_mtu(slave_dev, orig_mtu);
+done:
+	return NOTIFY_DONE;
+}
+
+/**
+ * net_failover_slave_unregister - Unregister a slave netdev
+ *
+ * @slave_dev: slave netdev that is being unregistered
+ *
+ * Unregisters a slave device from a failover instance. For a 2 netdev model,
+ * this will be primary netdev. In case of a 3 netdev model, it can be a
+ * standby or a primary netdev.
+ *
+ */
+int net_failover_slave_unregister(struct net_device *slave_dev)
+{
+	struct net_device *standby_dev, *primary_dev, *failover_dev;
+	struct net_failover_info *nfo_info;
+	struct net_failover_ops *nfo_ops;
+	bool slave_is_standby;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
+	if (!failover_dev)
+		goto done;
+
+	if (nfo_ops && nfo_ops->slave_unregister)
+		return nfo_ops->slave_unregister(slave_dev, failover_dev);
+
+	nfo_info = netdev_priv(failover_dev);
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
+
+	netdev_rx_handler_unregister(slave_dev);
+	netdev_upper_dev_unlink(slave_dev, failover_dev);
+	vlan_vids_del_by_dev(slave_dev, failover_dev);
+	dev_uc_unsync(slave_dev, failover_dev);
+	dev_mc_unsync(slave_dev, failover_dev);
+	dev_close(slave_dev);
+	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
+
+	nfo_info = netdev_priv(failover_dev);
+	net_failover_get_stats(failover_dev, &nfo_info->failover_stats);
+
+	if (slave_is_standby) {
+		RCU_INIT_POINTER(nfo_info->standby_dev, NULL);
+	} else {
+		RCU_INIT_POINTER(nfo_info->primary_dev, NULL);
+		if (standby_dev) {
+			failover_dev->min_mtu = standby_dev->min_mtu;
+			failover_dev->max_mtu = standby_dev->max_mtu;
+		}
+	}
+
+	dev_put(slave_dev);
+
+	net_failover_compute_features(failover_dev);
+
+	netdev_info(failover_dev, "failover %s slave:%s unregistered\n",
+		    slave_is_standby ? "standby" : "primary", slave_dev->name);
+
+done:
+	return NOTIFY_DONE;
+}
+EXPORT_SYMBOL_GPL(net_failover_slave_unregister);
+
+static int net_failover_slave_link_change(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev, *primary_dev, *standby_dev;
+	struct net_failover_info *nfo_info;
+	struct net_failover_ops *nfo_ops;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
+	if (!failover_dev)
+		goto done;
+
+	if (nfo_ops && nfo_ops->slave_link_change)
+		return nfo_ops->slave_link_change(slave_dev, failover_dev);
+
+	if (!netif_running(failover_dev))
+		goto done;
+
+	nfo_info = netdev_priv(failover_dev);
+
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
+	    (standby_dev && net_failover_xmit_ready(standby_dev))) {
+		netif_carrier_on(failover_dev);
+		netif_tx_wake_all_queues(failover_dev);
+	} else {
+		net_failover_get_stats(failover_dev, &nfo_info->failover_stats);
+		netif_carrier_off(failover_dev);
+		netif_tx_stop_all_queues(failover_dev);
+	}
+
+	net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
+
+done:
+	return NOTIFY_DONE;
+}
+
+static int net_failover_slave_change_name(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev, *primary_dev, *standby_dev;
+	struct net_failover_info *nfo_info;
+	struct net_failover_ops *nfo_ops;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
+	if (!failover_dev)
+		goto done;
+
+	if (!netif_running(failover_dev))
+		goto done;
+
+	nfo_info = netdev_priv(failover_dev);
+
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	/* We need to bring up the slave after the rename by udev in case
+	 * open failed with EBUSY when it was registered.
+	 */
+	dev_open(slave_dev);
+
+done:
+	return NOTIFY_DONE;
+}
+
+static int
+net_failover_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip parent events */
+	if (netif_is_failover(event_dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return net_failover_slave_register(event_dev);
+	case NETDEV_UNREGISTER:
+		return net_failover_slave_unregister(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return net_failover_slave_link_change(event_dev);
+	case NETDEV_CHANGENAME:
+		return net_failover_slave_change_name(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block net_failover_notifier = {
+	.notifier_call = net_failover_event,
+};
+
+static void
+net_failover_existing_slave_register(struct net_device *failover_dev)
+{
+	struct net *net = dev_net(failover_dev);
+	struct net_device *dev;
+
+	rtnl_lock();
+	for_each_netdev(net, dev) {
+		if (netif_is_failover(dev))
+			continue;
+		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
+			net_failover_slave_register(dev);
+	}
+	rtnl_unlock();
+}
+
+/**
+ * net_failover_register - Register a failover instance
+ *
+ * @dev: failover or standby netdev
+ * @ops: failover ops
+ *
+ * Paravirtual drivers supporting 3-netdev model call this routine indirectly
+ * via net_failover_create(). It passes failover netdev and ops will be NULL
+ * as the slave events are handled internally.
+ * Paravirtual drivers supporting 2-netdev model call this routine by passing
+ * standby netdev and ops that are called to handle slave register/unregister/
+ * link change events.
+ *
+ * Return: pointer to failover instance
+ */
+struct net_failover *net_failover_register(struct net_device *dev,
+					   struct net_failover_ops *ops)
+{
+	struct net_failover *failover;
+
+	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
+	if (!failover)
+		return ERR_PTR(-ENOMEM);
+
+	rcu_assign_pointer(failover->ops, ops);
+	dev_hold(dev);
+	dev->priv_flags |= IFF_FAILOVER;
+	rcu_assign_pointer(failover->failover_dev, dev);
+
+	spin_lock(&net_failover_lock);
+	list_add_tail(&failover->list, &net_failover_list);
+	spin_unlock(&net_failover_lock);
+
+	netdev_info(dev, "failover master:%s registered\n", dev->name);
+
+	net_failover_existing_slave_register(dev);
+
+	return failover;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(net_failover_register);
+
+/**
+ * net_failover_unregister - Unregister a failover instance
+ *
+ * @failover: pointer to failover instance
+ *
+ * Unregisters and frees a failover instance.
+ */
+void net_failover_unregister(struct net_failover *failover)
+{
+	struct net_device *failover_dev;
+
+	failover_dev = rcu_dereference(failover->failover_dev);
+
+	netdev_info(failover_dev, "failover master:%s unregistered\n",
+		    failover_dev->name);
+
+	failover_dev->priv_flags &= ~IFF_FAILOVER;
+	dev_put(failover_dev);
+
+	spin_lock(&net_failover_lock);
+	list_del(&failover->list);
+	spin_unlock(&net_failover_lock);
+
+	kfree(failover);
+}
+EXPORT_SYMBOL_GPL(net_failover_unregister);
+
+/**
+ * net_failover_create - Create and register a failover instance
+ *
+ * @dev: standby netdev
+ *
+ * Creates a failover netdev and registers a failover instance for a standby
+ * netdev. Used by paravirtual drivers that use 3-netdev model.
+ * The failover netdev acts as a master device and controls 2 slave devices -
+ * the original standby netdev and a VF netdev with the same MAC gets
+ * registered as primary netdev.
+ *
+ * Return: pointer to failover instance
+ */
+struct net_failover *net_failover_create(struct net_device *standby_dev)
+{
+	struct device *dev = standby_dev->dev.parent;
+	struct net_device *failover_dev;
+	struct net_failover *failover;
+	int err;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that VF devices being enslaved won't have too many queues.
+	 */
+	failover_dev = alloc_etherdev_mq(sizeof(struct net_failover_info), 16);
+	if (!failover_dev) {
+		dev_err(dev, "Unable to allocate failover_netdev!\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	dev_net_set(failover_dev, dev_net(standby_dev));
+	SET_NETDEV_DEV(failover_dev, dev);
+
+	failover_dev->netdev_ops = &failover_dev_ops;
+	failover_dev->ethtool_ops = &failover_ethtool_ops;
+
+	/* Initialize the device options */
+	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire failover netdev's netif_tx_lock when transmitting */
+	failover_dev->features |= NETIF_F_LLTX;
+
+	/* Don't allow failover devices to change network namespaces. */
+	failover_dev->features |= NETIF_F_NETNS_LOCAL;
+
+	failover_dev->hw_features = FAILOVER_VLAN_FEATURES |
+				    NETIF_F_HW_VLAN_CTAG_TX |
+				    NETIF_F_HW_VLAN_CTAG_RX |
+				    NETIF_F_HW_VLAN_CTAG_FILTER;
+
+	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	failover_dev->features |= failover_dev->hw_features;
+
+	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
+	       failover_dev->addr_len);
+
+	failover_dev->min_mtu = standby_dev->min_mtu;
+	failover_dev->max_mtu = standby_dev->max_mtu;
+
+	err = register_netdev(failover_dev);
+	if (err) {
+		dev_err(dev, "Unable to register failover_dev!\n");
+		goto err_register_netdev;
+	}
+
+	netif_carrier_off(failover_dev);
+
+	failover = net_failover_register(failover_dev, NULL);
+	if (IS_ERR(failover))
+		goto err_failover_register;
+
+	return failover;
+
+err_failover_register:
+	unregister_netdev(failover_dev);
+err_register_netdev:
+	free_netdev(failover_dev);
+
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(net_failover_create);
+
+/**
+ * net_failover_destroy - Destroy a failover instance
+ *
+ * @failover: pointer to failover instance
+ *
+ * Unregisters any slave netdevs associated with the failover instance by
+ * calling net_failover_slave_unregister().
+ * unregisters the failover instance itself and finally frees the failover
+ * netdev. Used by paravirtual drivers that use 3-netdev model.
+ *
+ */
+void net_failover_destroy(struct net_failover *failover)
+{
+	struct net_failover_info *nfo_info;
+	struct net_device *failover_dev;
+	struct net_device *slave_dev;
+
+	if (!failover)
+		return;
+
+	failover_dev = rcu_dereference(failover->failover_dev);
+	nfo_info = netdev_priv(failover_dev);
+
+	netif_device_detach(failover_dev);
+
+	rtnl_lock();
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		net_failover_slave_unregister(slave_dev);
+
+	slave_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		net_failover_slave_unregister(slave_dev);
+
+	net_failover_unregister(failover);
+
+	unregister_netdevice(failover_dev);
+
+	rtnl_unlock();
+
+	free_netdev(failover_dev);
+}
+EXPORT_SYMBOL_GPL(net_failover_destroy);
+
+static __init int
+net_failover_init(void)
+{
+	register_netdevice_notifier(&net_failover_notifier);
+
+	return 0;
+}
+module_init(net_failover_init);
+
+static __exit
+void net_failover_exit(void)
+{
+	unregister_netdevice_notifier(&net_failover_notifier);
+}
+module_exit(net_failover_exit);
+
+MODULE_DESCRIPTION("Failover infrastructure/interface for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v10 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Sridhar Samudrala @ 2018-05-07 22:10 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1525731046-10989-1-git-send-email-sridhar.samudrala@intel.com>

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to unplug the VF device from the guest on the source
host and reset the MAC filter of the VF to initiate failover of datapath
to virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

It uses the generic failover framework that provides 2 functions to create
and destroy a master failover netdev. When STANDBY feature is enabled, an
additional netdev(failover netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'standby' netdev and a passthru device with the same MAC is
registered as 'primary' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/Kconfig      |  1 +
 drivers/net/virtio_net.c | 38 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index a029b27fd002..30e3bb0532e4 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -332,6 +332,7 @@ config VETH
 config VIRTIO_NET
 	tristate "Virtio network driver"
 	depends on VIRTIO
+	select NET_FAILOVER
 	---help---
 	  This is the virtual network driver for virtio.  It can be used with
 	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 213fddc70fd0..e7428d20763c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
 #include <net/route.h>
 #include <net/xdp.h>
+#include <net/net_failover.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -210,6 +213,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* failover when STANDBY feature enabled */
+	struct net_failover *failover;
 };
 
 struct padded_vnet_hdr {
@@ -1523,6 +1529,9 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
 	struct sockaddr *addr;
 	struct scatterlist sg;
 
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
+		return -EOPNOTSUPP;
+
 	addr = kmemdup(p, sizeof(*addr), GFP_KERNEL);
 	if (!addr)
 		return -ENOMEM;
@@ -2306,6 +2315,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+				      size_t len)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int ret;
+
+	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
+		return -EOPNOTSUPP;
+
+	ret = snprintf(buf, len, "sby");
+	if (ret >= len)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -2323,6 +2348,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
 	.ndo_xdp_flush		= virtnet_xdp_flush,
 	.ndo_features_check	= passthru_features_check,
+	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2876,10 +2902,16 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
+		vi->failover = net_failover_create(vi->dev);
+		if (IS_ERR(vi->failover))
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_failover;
 	}
 
 	virtio_device_ready(vdev);
@@ -2916,6 +2948,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_failover:
+	net_failover_destroy(vi->failover);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2950,6 +2984,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	net_failover_destroy(vi->failover);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v10 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Sridhar Samudrala @ 2018-05-07 22:10 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1525731046-10989-1-git-send-email-sridhar.samudrala@intel.com>

Use the registration/notification framework supported by the generic
failover infrastructure.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 134 +++++++---------------------------------
 3 files changed, 26 insertions(+), 111 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 0765d5f61714..1f8419fc7c7f 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -2,5 +2,6 @@ config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
 	select UCS2_STRING
+	select NET_FAILOVER
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 6ebe39a3dde6..2ec18344c0e8 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -932,6 +932,8 @@ struct net_device_context {
 	u32 vf_alloc;
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
+
+	struct net_failover *failover;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..f852b3a9403a 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
 #include <net/pkt_sched.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
+#include <net/net_failover.h>
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		if (ether_addr_equal(mac, dev->perm_addr))
-			return dev;
-	}
-
-	return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		struct net_device_context *net_device_ctx;
-
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		net_device_ctx = netdev_priv(dev);
-		if (!rtnl_dereference(net_device_ctx->nvdev))
-			continue;	/* device is removed */
-
-		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-			return dev;	/* a match */
-	}
-
-	return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1914,24 +1875,15 @@ static void netvsc_vf_setup(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_register_vf(struct net_device *vf_netdev,
+			      struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
 
 	if (vf_netdev->addr_len != ETH_ALEN)
 		return NOTIFY_DONE;
 
-	/*
-	 * We will use the MAC address to locate the synthetic interface to
-	 * associate with the VF interface. If we don't find a matching
-	 * synthetic interface, move on.
-	 */
-	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
@@ -1948,17 +1900,13 @@ static int netvsc_register_vf(struct net_device *vf_netdev)
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
-	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
@@ -1971,15 +1919,11 @@ static int netvsc_vf_changed(struct net_device *vf_netdev)
 	return NOTIFY_OK;
 }
 
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_unregister_vf(struct net_device *vf_netdev,
+				struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
 
@@ -1993,6 +1937,12 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
 	return NOTIFY_OK;
 }
 
+static struct net_failover_ops netvsc_failover_ops = {
+	.slave_register		= netvsc_register_vf,
+	.slave_unregister	= netvsc_unregister_vf,
+	.slave_link_change	= netvsc_vf_changed,
+};
+
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2082,8 +2032,15 @@ static int netvsc_probe(struct hv_device *dev,
 		goto register_failed;
 	}
 
+	net_device_ctx->failover = net_failover_register(net,
+							 &netvsc_failover_ops);
+	if (IS_ERR(net_device_ctx->failover))
+		goto err_failover;
+
 	return ret;
 
+err_failover:
+	unregister_netdev(net);
 register_failed:
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
@@ -2124,13 +2081,15 @@ static int netvsc_remove(struct hv_device *dev)
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
-		netvsc_unregister_vf(vf_netdev);
+		net_failover_slave_unregister(vf_netdev);
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
+	net_failover_unregister(ndev_ctx->failover);
+
 	rtnl_unlock();
 	rcu_read_unlock();
 
@@ -2157,54 +2116,8 @@ static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
-			       unsigned long event, void *ptr)
-{
-	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
-	/* Skip our own events */
-	if (event_dev->netdev_ops == &device_ops)
-		return NOTIFY_DONE;
-
-	/* Avoid non-Ethernet type devices */
-	if (event_dev->type != ARPHRD_ETHER)
-		return NOTIFY_DONE;
-
-	/* Avoid Vlan dev with same MAC registering as VF */
-	if (is_vlan_dev(event_dev))
-		return NOTIFY_DONE;
-
-	/* Avoid Bonding master dev with same MAC registering as VF */
-	if ((event_dev->priv_flags & IFF_BONDING) &&
-	    (event_dev->flags & IFF_MASTER))
-		return NOTIFY_DONE;
-
-	switch (event) {
-	case NETDEV_REGISTER:
-		return netvsc_register_vf(event_dev);
-	case NETDEV_UNREGISTER:
-		return netvsc_unregister_vf(event_dev);
-	case NETDEV_UP:
-	case NETDEV_DOWN:
-		return netvsc_vf_changed(event_dev);
-	default:
-		return NOTIFY_DONE;
-	}
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
-	.notifier_call = netvsc_netdev_event,
-};
-
 static void __exit netvsc_drv_exit(void)
 {
-	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2224,7 +2137,6 @@ static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
-	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v10 0/4] Enable virtio_net to act as a standby for a passthru device
From: Sridhar Samudrala @ 2018-05-07 22:10 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_STANDBY that can be
used by hypervisor to indicate that virtio_net interface should act as
a standby for another device with the same MAC address.

Patch 2 introduces a failover module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual 
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When STANDBY feature is enabled, virtio_net driver creates
an additional 'failover' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'standby' netdev and a passthru/vf device with the same MAC gets
registered as 'primary' netdev. Both 'standby' and 'failover' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'failover' netdev. The 'failover' netdev chooses 'primary'
netdev as default for transmits when it is available with link up and
running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by failover module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'primary' netdev gets registered. 
 
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v10:
- Tested live migration with virtio-net/AVF(i40evf) configured in failover
  mode while running iperf in background. Tried static ip and dhcp
  configurations using 'network' scripts and Network Manager.
  - To avoid dhcp requests to be sent over the lower 'standby' and 'primary'
    netdevs, ifcfg-xxx scripts for the lower devices can be created.
- Build tested netvsc module.
Updates:
- fix net_failover_open() to update failover CARRIER correctly based on
  standby and primary states. 
- fix net_failover_handle_frame() to handle frames received on standby
  when primary is present.
- replace netdev_upper_dev_link with netdev_master_upper_dev_link and
  handle lower dev state changes.
- fix net_failver_create() and net_failover_register() interfaces to
  use ERR_PTR and avoid arg **
- disable setting mac address when virtio-net in STANDBY mode
- document exported symbols
- added entry to MAINTAINERS file

v9:
Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET 
are enabled. (stephen)

v8:
- Made the failover managment routines more robust by updating the feature 
  bits/other fields in the failover netdev when slave netdevs are 
  registered/unregistered. (mst)
- added support for handling vlans.
- Limited the changes in netvsc to only use the notifier/event/lookups
  from the failover module. The slave register/unregister/link-change 
  handlers are only updated to use the getbymac routine to get the 
  upper netdev. There is no change in their functionality. (stephen)
- renamed structs/function/file names to use net_failover prefix. (mst)

v7
- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
  (jiri, mst)
- re-arranged dev_open() and dev_set_mtu() calls in the register routines
  so that they don't get called for 2-netdev model. (stephen)
- fixed select_queue() routine to do queue selection based on VF if it is
  registered as primary. (stephen)
-  minor bugfixes

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
  net: Introduce generic failover module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the failover
    framework

 MAINTAINERS                     |    7 +
 drivers/net/Kconfig             |    1 +
 drivers/net/hyperv/Kconfig      |    1 +
 drivers/net/hyperv/hyperv_net.h |    2 +
 drivers/net/hyperv/netvsc_drv.c |  134 +----
 drivers/net/virtio_net.c        |   40 +-
 include/linux/netdevice.h       |   16 +
 include/net/net_failover.h      |   52 ++
 include/uapi/linux/virtio_net.h |    3 +
 net/Kconfig                     |   10 +
 net/core/Makefile               |    1 +
 net/core/net_failover.c         | 1044 +++++++++++++++++++++++++++++++++++++++
 12 files changed, 1198 insertions(+), 113 deletions(-)
 create mode 100644 include/net/net_failover.h
 create mode 100644 net/core/net_failover.c

-- 
2.14.3

^ permalink raw reply

* [PATCH v1 iproute2-next 0/3] RDMA tool driver-specific resource tracking
From: Steve Wise @ 2018-05-07 16:06 UTC (permalink / raw)
  To: dsahern, leon; +Cc: stephen, netdev, linux-rdma

Hello,

This series enhances the iproute2 rdma tool to include displaying
driver-specific resource attributes.  It is the user-space part of the
kernel driver resource tracking series that has been accepted for merging
into linux-4.18 [1]

If there are no additional review comments, it can now be merged, I think.

Changes since v0/rfc:
- changed "provider" to "driver" based on kernel side changes
- updated man pages
- removed "RFC" tag

Thanks,

Steve.

[1] https://www.spinics.net/lists/linux-rdma/msg64199.html

Steve Wise (3):
  rdma: update rdma_netlink.h to get driver attrs
  rdma: print driver resource attributes
  rdma: update man pages

 man/man8/rdma-resource.8              |  29 ++++-
 man/man8/rdma.8                       |   2 +-
 rdma/include/uapi/rdma/rdma_netlink.h |  37 ++++++-
 rdma/rdma.c                           |   7 +-
 rdma/rdma.h                           |  11 ++
 rdma/res.c                            |  30 ++----
 rdma/utils.c                          | 194 ++++++++++++++++++++++++++++++++++
 7 files changed, 284 insertions(+), 26 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH v1 iproute2-next 1/3] rdma: update rdma_netlink.h to get driver attrs
From: Steve Wise @ 2018-05-07 15:53 UTC (permalink / raw)
  To: dsahern, leon; +Cc: stephen, netdev, linux-rdma
In-Reply-To: <cover.1525709213.git.swise@opengridcomputing.com>

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 rdma/include/uapi/rdma/rdma_netlink.h | 37 ++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/rdma/include/uapi/rdma/rdma_netlink.h
index 45474f1..40be0d8 100644
--- a/rdma/include/uapi/rdma/rdma_netlink.h
+++ b/rdma/include/uapi/rdma/rdma_netlink.h
@@ -249,10 +249,22 @@ enum rdma_nldev_command {
 	RDMA_NLDEV_NUM_OPS
 };
 
+enum {
+	RDMA_NLDEV_ATTR_ENTRY_STRLEN = 16,
+};
+
+enum rdma_nldev_print_type {
+	RDMA_NLDEV_PRINT_TYPE_UNSPEC,
+	RDMA_NLDEV_PRINT_TYPE_HEX,
+};
+
 enum rdma_nldev_attr {
 	/* don't change the order or add anything between, this is ABI! */
 	RDMA_NLDEV_ATTR_UNSPEC,
 
+	/* Pad attribute for 64b alignment */
+	RDMA_NLDEV_ATTR_PAD = RDMA_NLDEV_ATTR_UNSPEC,
+
 	/* Identifier for ib_device */
 	RDMA_NLDEV_ATTR_DEV_INDEX,		/* u32 */
 
@@ -387,8 +399,31 @@ enum rdma_nldev_attr {
 	RDMA_NLDEV_ATTR_RES_PD_ENTRY,		/* nested table */
 	RDMA_NLDEV_ATTR_RES_LOCAL_DMA_LKEY,	/* u32 */
 	RDMA_NLDEV_ATTR_RES_UNSAFE_GLOBAL_RKEY,	/* u32 */
+	/*
+	 * driver-specific attributes.
+	 */
+	RDMA_NLDEV_ATTR_DRIVER,			/* nested table */
+	RDMA_NLDEV_ATTR_DRIVER_ENTRY,		/* nested table */
+	RDMA_NLDEV_ATTR_DRIVER_STRING,		/* string */
+	/*
+	 * u8 values from enum rdma_nldev_print_type
+	 */
+	RDMA_NLDEV_ATTR_DRIVER_PRINT_TYPE,	/* u8 */
+	RDMA_NLDEV_ATTR_DRIVER_S32,		/* s32 */
+	RDMA_NLDEV_ATTR_DRIVER_U32,		/* u32 */
+	RDMA_NLDEV_ATTR_DRIVER_S64,		/* s64 */
+	RDMA_NLDEV_ATTR_DRIVER_U64,		/* u64 */
 
-	/* Netdev information for relevant protocols, like RoCE and iWARP */
+	/*
+	 * Provides logical name and index of netdevice which is
+	 * connected to physical port. This information is relevant
+	 * for RoCE and iWARP.
+	 *
+	 * The netdevices which are associated with containers are
+	 * supposed to be exported together with GID table once it
+	 * will be exposed through the netlink. Because the
+	 * associated netdevices are properties of GIDs.
+	 */
 	RDMA_NLDEV_ATTR_NDEV_INDEX,		/* u32 */
 	RDMA_NLDEV_ATTR_NDEV_NAME,		/* string */
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v1 iproute2-next 2/3] rdma: print driver resource attributes
From: Steve Wise @ 2018-05-07 15:53 UTC (permalink / raw)
  To: dsahern, leon; +Cc: stephen, netdev, linux-rdma
In-Reply-To: <cover.1525709213.git.swise@opengridcomputing.com>

This enhancement allows printing rdma device-specific state, if provided
by the kernel.  This is done in a generic manner, so rdma tool doesn't
need to know about the details of every type of rdma device.

Driver attributes for a rdma resource are in the form of <key,
[print_type], value> tuples, where the key is a string and the value can
be any supported driver attribute.  The print_type attribute, if present,
provides a print format to use vs the standard print format for the type.
For example, the default print type for a PROVIDER_S32 value is "%d ",
but "0x%x " if the print_type of PRINT_TYPE_HEX is included inthe tuple.

Driver resources are only printed when the -dd flag is present.
If -p is present, then the output is formatted to not exceed 80 columns,
otherwise it is printed as a single row to be grep/awk friendly.

Example output:

# rdma resource show qp lqpn 1028 -dd -p
link cxgb4_0/- lqpn 1028 rqpn 0 type RC state RTS rq-psn 0 sq-psn 0 path-mig-state MIGRATED pid 0 comm [nvme_rdma]
    sqid 1028 flushed 0 memsize 123968 cidx 85 pidx 85 wq_pidx 106 flush_cidx 85 in_use 0
    size 386 flags 0x0 rqid 1029 memsize 16768 cidx 43 pidx 41 wq_pidx 171 msn 44 rqt_hwaddr 0x2a8a5d00
    rqt_size 256 in_use 128 size 130 idx 43 wr_id 0xffff881057c03408 idx 40 wr_id 0xffff881057c033f0

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 rdma/rdma.c  |   7 ++-
 rdma/rdma.h  |  11 ++++
 rdma/res.c   |  30 +++------
 rdma/utils.c | 194 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 221 insertions(+), 21 deletions(-)

diff --git a/rdma/rdma.c b/rdma/rdma.c
index b43e538..0155627 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -132,6 +132,7 @@ int main(int argc, char **argv)
 	const char *batch_file = NULL;
 	bool pretty_output = false;
 	bool show_details = false;
+	bool show_driver_details = false;
 	bool json_output = false;
 	bool force = false;
 	char *filename;
@@ -152,7 +153,10 @@ int main(int argc, char **argv)
 			pretty_output = true;
 			break;
 		case 'd':
-			show_details = true;
+			if (show_details)
+				show_driver_details = true;
+			else
+				show_details = true;
 			break;
 		case 'j':
 			json_output = true;
@@ -180,6 +184,7 @@ int main(int argc, char **argv)
 	argv += optind;
 
 	rd.show_details = show_details;
+	rd.show_driver_details = show_driver_details;
 	rd.json_output = json_output;
 	rd.pretty_output = pretty_output;
 
diff --git a/rdma/rdma.h b/rdma/rdma.h
index 1908fc4..fcaf9e6 100644
--- a/rdma/rdma.h
+++ b/rdma/rdma.h
@@ -55,6 +55,7 @@ struct rd {
 	char **argv;
 	char *filename;
 	bool show_details;
+	bool show_driver_details;
 	struct list_head dev_map_list;
 	uint32_t dev_idx;
 	uint32_t port_idx;
@@ -115,4 +116,14 @@ int rd_recv_msg(struct rd *rd, mnl_cb_t callback, void *data, uint32_t seq);
 void rd_prepare_msg(struct rd *rd, uint32_t cmd, uint32_t *seq, uint16_t flags);
 int rd_dev_init_cb(const struct nlmsghdr *nlh, void *data);
 int rd_attr_cb(const struct nlattr *attr, void *data);
+int rd_attr_check(const struct nlattr *attr, int *typep);
+
+/*
+ * Print helpers
+ */
+void print_driver_table(struct rd *rd, struct nlattr *tb);
+void newline(struct rd *rd);
+void newline_indent(struct rd *rd);
+#define MAX_LINE_LENGTH 80
+
 #endif /* _RDMA_TOOL_H_ */
diff --git a/rdma/res.c b/rdma/res.c
index 1a0aab6..074b992 100644
--- a/rdma/res.c
+++ b/rdma/res.c
@@ -439,10 +439,8 @@ static int res_qp_parse_cb(const struct nlmsghdr *nlh, void *data)
 		if (nla_line[RDMA_NLDEV_ATTR_RES_PID])
 			free(comm);
 
-		if (rd->json_output)
-			jsonw_end_array(rd->jw);
-		else
-			pr_out("\n");
+		print_driver_table(rd, nla_line[RDMA_NLDEV_ATTR_DRIVER]);
+		newline(rd);
 	}
 	return MNL_CB_OK;
 }
@@ -678,10 +676,8 @@ static int res_cm_id_parse_cb(const struct nlmsghdr *nlh, void *data)
 		if (nla_line[RDMA_NLDEV_ATTR_RES_PID])
 			free(comm);
 
-		if (rd->json_output)
-			jsonw_end_array(rd->jw);
-		else
-			pr_out("\n");
+		print_driver_table(rd, nla_line[RDMA_NLDEV_ATTR_DRIVER]);
+		newline(rd);
 	}
 	return MNL_CB_OK;
 }
@@ -804,10 +800,8 @@ static int res_cq_parse_cb(const struct nlmsghdr *nlh, void *data)
 		if (nla_line[RDMA_NLDEV_ATTR_RES_PID])
 			free(comm);
 
-		if (rd->json_output)
-			jsonw_end_array(rd->jw);
-		else
-			pr_out("\n");
+		print_driver_table(rd, nla_line[RDMA_NLDEV_ATTR_DRIVER]);
+		newline(rd);
 	}
 	return MNL_CB_OK;
 }
@@ -919,10 +913,8 @@ static int res_mr_parse_cb(const struct nlmsghdr *nlh, void *data)
 		if (nla_line[RDMA_NLDEV_ATTR_RES_PID])
 			free(comm);
 
-		if (rd->json_output)
-			jsonw_end_array(rd->jw);
-		else
-			pr_out("\n");
+		print_driver_table(rd, nla_line[RDMA_NLDEV_ATTR_DRIVER]);
+		newline(rd);
 	}
 	return MNL_CB_OK;
 }
@@ -1004,10 +996,8 @@ static int res_pd_parse_cb(const struct nlmsghdr *nlh, void *data)
 		if (nla_line[RDMA_NLDEV_ATTR_RES_PID])
 			free(comm);
 
-		if (rd->json_output)
-			jsonw_end_array(rd->jw);
-		else
-			pr_out("\n");
+		print_driver_table(rd, nla_line[RDMA_NLDEV_ATTR_DRIVER]);
+		newline(rd);
 	}
 	return MNL_CB_OK;
 }
diff --git a/rdma/utils.c b/rdma/utils.c
index 49c967f..452fe92 100644
--- a/rdma/utils.c
+++ b/rdma/utils.c
@@ -11,6 +11,7 @@
 
 #include "rdma.h"
 #include <ctype.h>
+#include <inttypes.h>
 
 int rd_argc(struct rd *rd)
 {
@@ -393,8 +394,32 @@ static const enum mnl_attr_data_type nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_RES_MRLEN] = MNL_TYPE_U64,
 	[RDMA_NLDEV_ATTR_NDEV_INDEX]		= MNL_TYPE_U32,
 	[RDMA_NLDEV_ATTR_NDEV_NAME]		= MNL_TYPE_NUL_STRING,
+	[RDMA_NLDEV_ATTR_DRIVER] = MNL_TYPE_NESTED,
+	[RDMA_NLDEV_ATTR_DRIVER_ENTRY] = MNL_TYPE_NESTED,
+	[RDMA_NLDEV_ATTR_DRIVER_STRING] = MNL_TYPE_NUL_STRING,
+	[RDMA_NLDEV_ATTR_DRIVER_PRINT_TYPE] = MNL_TYPE_U8,
+	[RDMA_NLDEV_ATTR_DRIVER_S32] = MNL_TYPE_U32,
+	[RDMA_NLDEV_ATTR_DRIVER_U32] = MNL_TYPE_U32,
+	[RDMA_NLDEV_ATTR_DRIVER_S64] = MNL_TYPE_U64,
+	[RDMA_NLDEV_ATTR_DRIVER_U64] = MNL_TYPE_U64,
 };
 
+int rd_attr_check(const struct nlattr *attr, int *typep)
+{
+	int type;
+
+	if (mnl_attr_type_valid(attr, RDMA_NLDEV_ATTR_MAX) < 0)
+		return MNL_CB_ERROR;
+
+	type = mnl_attr_get_type(attr);
+
+	if (mnl_attr_validate(attr, nldev_policy[type]) < 0)
+		return MNL_CB_ERROR;
+
+	*typep = nldev_policy[type];
+	return MNL_CB_OK;
+}
+
 int rd_attr_cb(const struct nlattr *attr, void *data)
 {
 	const struct nlattr **tb = data;
@@ -660,3 +685,172 @@ struct dev_map *dev_map_lookup(struct rd *rd, bool allow_port_index)
 	free(dev_name);
 	return dev_map;
 }
+
+#define nla_type(attr) ((attr)->nla_type & NLA_TYPE_MASK)
+
+void newline(struct rd *rd)
+{
+	if (rd->json_output)
+		jsonw_end_array(rd->jw);
+	else
+		pr_out("\n");
+}
+
+void newline_indent(struct rd *rd)
+{
+	newline(rd);
+	if (!rd->json_output)
+		pr_out("    ");
+}
+
+static int print_driver_string(struct rd *rd, const char *key_str,
+				 const char *val_str)
+{
+	if (rd->json_output) {
+		jsonw_string_field(rd->jw, key_str, val_str);
+		return 0;
+	} else {
+		return pr_out("%s %s ", key_str, val_str);
+	}
+}
+
+static int print_driver_s32(struct rd *rd, const char *key_str, int32_t val,
+			      enum rdma_nldev_print_type print_type)
+{
+	if (rd->json_output) {
+		jsonw_int_field(rd->jw, key_str, val);
+		return 0;
+	}
+	switch (print_type) {
+	case RDMA_NLDEV_PRINT_TYPE_UNSPEC:
+		return pr_out("%s %d ", key_str, val);
+	case RDMA_NLDEV_PRINT_TYPE_HEX:
+		return pr_out("%s 0x%x ", key_str, val);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int print_driver_u32(struct rd *rd, const char *key_str, uint32_t val,
+			      enum rdma_nldev_print_type print_type)
+{
+	if (rd->json_output) {
+		jsonw_int_field(rd->jw, key_str, val);
+		return 0;
+	}
+	switch (print_type) {
+	case RDMA_NLDEV_PRINT_TYPE_UNSPEC:
+		return pr_out("%s %u ", key_str, val);
+	case RDMA_NLDEV_PRINT_TYPE_HEX:
+		return pr_out("%s 0x%x ", key_str, val);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int print_driver_s64(struct rd *rd, const char *key_str, int64_t val,
+			      enum rdma_nldev_print_type print_type)
+{
+	if (rd->json_output) {
+		jsonw_int_field(rd->jw, key_str, val);
+		return 0;
+	}
+	switch (print_type) {
+	case RDMA_NLDEV_PRINT_TYPE_UNSPEC:
+		return pr_out("%s %" PRId64 " ", key_str, val);
+	case RDMA_NLDEV_PRINT_TYPE_HEX:
+		return pr_out("%s 0x%" PRIx64 " ", key_str, val);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int print_driver_u64(struct rd *rd, const char *key_str, uint64_t val,
+			      enum rdma_nldev_print_type print_type)
+{
+	if (rd->json_output) {
+		jsonw_int_field(rd->jw, key_str, val);
+		return 0;
+	}
+	switch (print_type) {
+	case RDMA_NLDEV_PRINT_TYPE_UNSPEC:
+		return pr_out("%s %" PRIu64 " ", key_str, val);
+	case RDMA_NLDEV_PRINT_TYPE_HEX:
+		return pr_out("%s 0x%" PRIx64 " ", key_str, val);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int print_driver_entry(struct rd *rd, struct nlattr *key_attr,
+				struct nlattr *val_attr,
+				enum rdma_nldev_print_type print_type)
+{
+	const char *key_str = mnl_attr_get_str(key_attr);
+	int attr_type = nla_type(val_attr);
+
+	switch (attr_type) {
+	case RDMA_NLDEV_ATTR_DRIVER_STRING:
+		return print_driver_string(rd, key_str,
+				mnl_attr_get_str(val_attr));
+	case RDMA_NLDEV_ATTR_DRIVER_S32:
+		return print_driver_s32(rd, key_str,
+				mnl_attr_get_u32(val_attr), print_type);
+	case RDMA_NLDEV_ATTR_DRIVER_U32:
+		return print_driver_u32(rd, key_str,
+				mnl_attr_get_u32(val_attr), print_type);
+	case RDMA_NLDEV_ATTR_DRIVER_S64:
+		return print_driver_s64(rd, key_str,
+				mnl_attr_get_u64(val_attr), print_type);
+	case RDMA_NLDEV_ATTR_DRIVER_U64:
+		return print_driver_u64(rd, key_str,
+				mnl_attr_get_u64(val_attr), print_type);
+	}
+	return -EINVAL;
+}
+
+void print_driver_table(struct rd *rd, struct nlattr *tb)
+{
+	int print_type = RDMA_NLDEV_PRINT_TYPE_UNSPEC;
+	struct nlattr *tb_entry, *key = NULL, *val;
+	int type, cc = 0;
+
+	if (!rd->show_driver_details || !tb)
+		return;
+
+	if (rd->pretty_output)
+		newline_indent(rd);
+
+	/*
+	 * Driver attrs are tuples of {key, [print-type], value}.
+	 * The key must be a string.  If print-type is present, it 
+	 * defines an alternate printf format type vs the native format
+	 * for the attribute.  And the value can be any available
+	 * driver type.
+	 */
+	mnl_attr_for_each_nested(tb_entry, tb) {
+
+		if (cc > MAX_LINE_LENGTH) {
+			if (rd->pretty_output)
+				newline_indent(rd);
+			cc = 0;
+		}
+		if (rd_attr_check(tb_entry, &type) != MNL_CB_OK)
+			return;
+		if (!key) {
+			if (type != MNL_TYPE_NUL_STRING)
+				return;
+			key = tb_entry;
+		} else if (type == MNL_TYPE_U8) {
+			print_type = mnl_attr_get_u8(tb_entry);
+		} else {
+			val = tb_entry;
+			cc += print_driver_entry(rd, key, val, print_type);
+			if (cc < 0)
+				return;
+			print_type = RDMA_NLDEV_PRINT_TYPE_UNSPEC;
+			key = NULL;
+		}
+	}
+	return;
+}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v1 iproute2-next 3/3] rdma: update man pages
From: Steve Wise @ 2018-05-07 15:53 UTC (permalink / raw)
  To: dsahern, leon; +Cc: stephen, netdev, linux-rdma
In-Reply-To: <cover.1525709213.git.swise@opengridcomputing.com>

Update the man pages for the resource attributes as well
as the driver-specific attributes.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 man/man8/rdma-resource.8 | 29 ++++++++++++++++++++++++++---
 man/man8/rdma.8          |  2 +-
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/man/man8/rdma-resource.8 b/man/man8/rdma-resource.8
index ff5d25d..40b073d 100644
--- a/man/man8/rdma-resource.8
+++ b/man/man8/rdma-resource.8
@@ -7,13 +7,16 @@ rdma-resource \- rdma resource configuration
 .in +8
 .ti -8
 .B rdma
-.RI "[ " OPTIONS " ]"
-.B resource
-.RI  " { " COMMAND " | "
+.RI "[ " OPTIONS " ] " RESOURCE " { " COMMAND " | "
 .BR help " }"
 .sp
 
 .ti -8
+.IR RESOURCE " := { "
+.BR cm_id " | " cq " | " mr " | " pd " | " qp " }"
+.sp
+
+.ti -8
 .IR OPTIONS " := { "
 \fB\-j\fR[\fIson\fR] |
 \fB\-d\fR[\fIetails\fR] }
@@ -70,11 +73,31 @@ rdma res show qp link mlx5_4/- -d
 Detailed view.
 .RE
 .PP
+rdma res show qp link mlx5_4/- -dd
+.RS 4
+Detailed view including driver-specific details.
+.RE
+.PP
 rdma res show qp link mlx5_4/1 lqpn 0-6
 .RS 4
 Limit to specific Local QPNs.
 .RE
 .PP
+rdma resource show cm_id dst-port 7174
+.RS 4
+Show CM_IDs with destination ip port of 7174.
+.RE
+.PP
+rdma resource show cm_id src-addr 172.16.0.100
+.RS 4
+Show CM_IDs bound to local ip address 172.16.0.100
+.RE
+.PP
+rdma resource show cq pid 30489
+.RS 4
+Show CQs belonging to pid 30489
+.RE
+.PP
 
 .SH SEE ALSO
 .BR rdma (8),
diff --git a/man/man8/rdma.8 b/man/man8/rdma.8
index 6f88f37..12aa149 100644
--- a/man/man8/rdma.8
+++ b/man/man8/rdma.8
@@ -49,7 +49,7 @@ If there were any errors during execution of the commands, the application retur
 
 .TP
 .BR "\-d" , " --details"
-Output detailed information.
+Output detailed information.  Adding a second \-d includes driver-specific details.
 
 .TP
 .BR "\-p" , " --pretty"
-- 
1.8.3.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox