Netdev List
 help / color / mirror / Atom feed
* [PATCH] hv_netvsc: select needed ucs2_string routine
From: Stephen Hemminger @ 2018-04-20 15:48 UTC (permalink / raw)
  To: davem; +Cc: netdev, Stephen Hemminger

The conversion of rndis friendly name to utf8 uses a standard
kernel routine which is optional in config. Therefore build
would fail for some configurations. Resolve by selecting needed
library.

Fixes: 0fe554a46a0f ("hv_netvsc: propogate Hyper-V friendly name into interface alias")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..0765d5f61714 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@
 config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
+	select UCS2_STRING
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: David Miller @ 2018-04-20 15:47 UTC (permalink / raw)
  To: mst
  Cc: stephen, sridhar.samudrala, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri
In-Reply-To: <20180420183505-mutt-send-email-mst@kernel.org>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 20 Apr 2018 18:43:54 +0300

> On Fri, Apr 20, 2018 at 08:28:02AM -0700, Stephen Hemminger wrote:
>> Plus, DPDK is now dependent on existing model.
> 
> DPDK does the kernel bypass thing, doesn't it? Why does the kernel care?

+1

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Samudrala, Sridhar @ 2018-04-20 15:46 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: mst, davem, netdev, virtualization, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, jasowang, loseweigh, jiri
In-Reply-To: <20180420082802.6ca37e4c@xeon-e3>



On 4/20/2018 8:28 AM, Stephen Hemminger wrote:
> On Thu, 19 Apr 2018 18:42:04 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> failover infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Do what you want to other devices but leave netvsc alone.
> Adding these failover ops does not reduce the code size, and really is
> no benefit.  The netvsc device driver needs to be backported to several
> other distributions and doing this makes that harder.
>
> I will NAK patches to change to common code for netvsc especially the
> three device model.  MS worked hard with distro vendors to support transparent
> mode, ans we really can't have a new model; or do backport.

failover_ops are specifically added to support both 2-netdev and 3-netdev models
This patch doesn't change netvsc model. It still keeps its 2-netdev model. From
netvsc, point of view it is just moving some code from netvsc to the failover module
and also i think the eventhandling and getbymac routines are more optimal.


> Plus, DPDK is now dependent on existing model.

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: David Miller @ 2018-04-20 15:46 UTC (permalink / raw)
  To: stephen
  Cc: sridhar.samudrala, mst, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri
In-Reply-To: <20180420082802.6ca37e4c@xeon-e3>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Fri, 20 Apr 2018 08:28:02 -0700

> On Thu, 19 Apr 2018 18:42:04 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
>> Use the registration/notification framework supported by the generic
>> failover infrastructure.
>> 
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> 
> Do what you want to other devices but leave netvsc alone.
> Adding these failover ops does not reduce the code size, and really is
> no benefit.  The netvsc device driver needs to be backported to several
> other distributions and doing this makes that harder.
> 
> I will NAK patches to change to common code for netvsc especially the
> three device model.  MS worked hard with distro vendors to support transparent
> mode, ans we really can't have a new model; or do backport.
> 
> Plus, DPDK is now dependent on existing model.

Stephen, I understand your situation.

But the result we have now is undesirable and it happened because MS
worked with distro vendors on this rather than the upstream community
and entities with other device with similar needs.

Please next time do the latter rather than the former.

Thank you.

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-20 15:43 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Sridhar Samudrala, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri
In-Reply-To: <20180420082802.6ca37e4c@xeon-e3>

On Fri, Apr 20, 2018 at 08:28:02AM -0700, Stephen Hemminger wrote:
> On Thu, 19 Apr 2018 18:42:04 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > failover infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> 
> Do what you want to other devices but leave netvsc alone.
> Adding these failover ops does not reduce the code size,

drivers/net/hyperv/Kconfig      |   1 +
drivers/net/hyperv/hyperv_net.h |   2 +
drivers/net/hyperv/netvsc_drv.c | 208 ++++++++++------------------------------
3 files changed, 55 insertions(+), 156 deletions(-)

100 lines gone.


> and really is
> no benefit.  The netvsc device driver needs to be backported to several
> other distributions and doing this makes that harder.
> 
> I will NAK patches to change to common code for netvsc

Wow.

> especially the
> three device model.

AFAIK these patches do not change netvsc to a three device model.

> MS worked hard with distro vendors to support transparent
> mode, ans we really can't have a new model;

That's why Sridhar worked hard to preserve a 2 device model for netvsc.

> or do backport.
>
> Plus, DPDK is now dependent on existing model.

DPDK does the kernel bypass thing, doesn't it? Why does the kernel care?

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: Eric Dumazet @ 2018-04-20 15:39 UTC (permalink / raw)
  To: Jonathan Corbet, Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Neal Cardwell,
	Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20180420091951.713c0b95@lwn.net>



On 04/20/2018 08:19 AM, Jonathan Corbet wrote:
> On Thu, 19 Apr 2018 18:01:32 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> We can keep mmap() nice interface, granted we can add one hook like in following patch.
>>
>> David, do you think such patch would be acceptable by lkml and mm/fs maintainers ?
>>
>> Alternative would be implementing an ioctl() or getsockopt() operation,
>> but it seems less natural...
> 

Hi Jonathan

> So I have little standing here, but what the heck, not letting that bother
> me has earned me a living for the last 20 years or so...:)
> 
> I think you should consider switching over to an interface where you
> mmap() the region once, and use ioctl() to move the data into that region,
> for a couple of reasons beyond the locking issues you've already found:
> 
>  - The "mmap() consumes data" semantics are a bit ... strange, IMO.
>    That's not what mmap() normally does.  People expect ioctl() to do
>    magic things, instead.

Well, the thing is that most of our use cases wont reuse same mmap() area.

RPC layer will provide all RPC with their associated pages to RPC consumers.

RPC consumers will decide to keep these pages or consume them.

So having to mmap() + another syscall to consume XXX bytes from receive queue is not
going to save cpu cycles :/

Having the ability to call mmap() multiple times for the same TCP payload is not
going to be of any use in real applications. This is why I only support 'offset 0'
for the last mmap() parameter.

> 
>  - I would expect it to be a tiny bit faster, since you wouldn't be doing
>    the VMA setup and teardown each time.

Maybe for the degenerated case we can reuse the same region over and over.

^ permalink raw reply

* Re: [PATCH v4 0/3] lan78xx: Read configuration from Device Tree
From: David Miller @ 2018-04-20 15:39 UTC (permalink / raw)
  To: phil
  Cc: woojung.huh, UNGLinuxDriver, robh+dt, mark.rutland, andrew,
	f.fainelli, mchehab, gregkh, linus.walleij, akpm, rdunlap, netdev,
	devicetree, linux-kernel, linux-usb
In-Reply-To: <1524157180-27276-1-git-send-email-phil@raspberrypi.org>

From: Phil Elwell <phil@raspberrypi.org>
Date: Thu, 19 Apr 2018 17:59:37 +0100

> The Microchip LAN78XX family of devices are Ethernet controllers with
> a USB interface. Despite being discoverable devices it can be useful to
> be able to configure them from Device Tree, particularly in low-cost
> applications without an EEPROM or programmed OTP.
> 
> This patch set adds support for reading the MAC address and LED modes from
> Device Tree.
> 
> v4:
> - Rename nodes in bindings doc.
> 
> v3:
> - Move LED setting into PHY driver.
> 
> v2:
> - Use eth_platform_get_mac_address.
> - Support up to 4 LEDs, and move LED mode constants into dt-bindings header.
> - Improve bindings document.
> - Remove EEE support.

Series applied, thanks Phil.

^ permalink raw reply

* Re: [PATCH net-next] sfc: set and clear interrupt affinity hints
From: David Miller @ 2018-04-20 15:37 UTC (permalink / raw)
  To: bkenward; +Cc: netdev, linux-net-drivers
In-Reply-To: <7ba18d83-95d6-b83c-d19c-a50bcec79da0@solarflare.com>

From: Bert Kenward <bkenward@solarflare.com>
Date: Thu, 19 Apr 2018 17:37:25 +0100

> Use cpumask_local_spread to provide interrupt affinity hints
> for each queue. This will spread interrupts across NUMA local
> CPUs first, extending to remote nodes if needed.
> 
> Signed-off-by: Bert Kenward <bkenward@solarflare.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net/ipv6: Fix ip6_convert_metrics() bug
From: David Miller @ 2018-04-20 15:36 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, dsa
In-Reply-To: <20180419161453.46977-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Thu, 19 Apr 2018 09:14:53 -0700

> If ip6_convert_metrics() fails to allocate memory, it should not
> overwrite rt->fib6_metrics or we risk a crash later as syzbot found.
 ...
> Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Michael S. Tsirkin @ 2018-04-20 15:34 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri
In-Reply-To: <d1bf0771-751b-85bc-4890-8232283131a2@intel.com>

On Fri, Apr 20, 2018 at 08:21:00AM -0700, Samudrala, Sridhar wrote:
> > > +	finfo = netdev_priv(failover_dev);
> > > +
> > > +	primary_dev = rtnl_dereference(finfo->primary_dev);
> > > +	standby_dev = rtnl_dereference(finfo->standby_dev);
> > > +
> > > +	if (slave_dev != primary_dev && slave_dev != standby_dev)
> > > +		goto done;
> > > +
> > > +	if ((primary_dev && failover_xmit_ready(primary_dev)) ||
> > > +	    (standby_dev && failover_xmit_ready(standby_dev))) {
> > > +		netif_carrier_on(failover_dev);
> > > +		netif_tx_wake_all_queues(failover_dev);
> > > +	} else {
> > > +		netif_carrier_off(failover_dev);
> > > +		netif_tx_stop_all_queues(failover_dev);
> > And I think it's a good idea to get stats from device here too.
> 
> Not sure why we need to get stats from lower devs here?

link down is often indication of a hardware problem.
lower dev might stop responding down the road.

> > > +static const struct net_device_ops failover_dev_ops = {
> > > +	.ndo_open		= failover_open,
> > > +	.ndo_stop		= failover_close,
> > > +	.ndo_start_xmit		= failover_start_xmit,
> > > +	.ndo_select_queue	= failover_select_queue,
> > > +	.ndo_get_stats64	= failover_get_stats,
> > > +	.ndo_change_mtu		= failover_change_mtu,
> > > +	.ndo_set_rx_mode	= failover_set_rx_mode,
> > > +	.ndo_validate_addr	= eth_validate_addr,
> > > +	.ndo_features_check	= passthru_features_check,
> > xdp support?
> 
> I think it should be possible to add it be calling the lower dev ndo_xdp routines
> with proper checks. can we add this later?

I'd be concerned that if you don't xdp userspace will keep poking
at lower devs. Then it will stop working if you add this later.

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] net/smc: handle sockopt TCP_NODELAY
From: David Miller @ 2018-04-20 15:31 UTC (permalink / raw)
  To: ubraun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180419135655.3058-3-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Thu, 19 Apr 2018 15:56:53 +0200

> @@ -1412,6 +1523,10 @@ static int smc_create(struct net *net, struct socket *sock, int protocol,
>  		sk_common_release(sk);
>  		goto out;
>  	}
> +	/* clc handshake should run with disabled Nagle algorithm */
> +	rc = kernel_setsockopt(smc->clcsock, SOL_TCP, TCP_NODELAY, (char *)&val,
> +			       sizeof(val));
> +	smc->deferred_nodelay_reset = 1; /* TCP_NODELAY is not the default */
>  	smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
>  	smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);

This is not what I asked for.

If the user asked for the socket option, you are unconditionally returning success
to the original user.

If it fails here during the smc create, it's too late to handle the error properly.

This is the problem you must solve before I can take these changes properly.

You aren't even really failing the smc_create() here, because if you were you
would kill and free up the clcsock and release 'sk'.

As it stands now you are returning an error, and not releasing resources, if
the kernel_setsockopt() fails.

But more fundamentally these semantics are terrible.  You must not ever create
a situation where you tell the user his setsockopt succeeded by in the end
not honoring that reqeust fully.  That is what your current changes allow to
happen.

^ permalink raw reply

* Re: [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Michael S. Tsirkin @ 2018-04-20 15:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Daly, Dan, Bjorn Helgaas, Duyck, Alexander H, linux-pci,
	virtio-dev, kvm, Netdev, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <CAKgT0Ucpw9D1u8MYoEX5KOij6sxs-5O9gPFTmOVq=Dgt=HUJ=g@mail.gmail.com>

On Fri, Apr 20, 2018 at 07:56:14AM -0700, Alexander Duyck wrote:
> > I think for virtio it should include the feature bit, yes.
> > Adding feature bit is very easy - post a patch to the virtio TC mailing
> > list, wait about a week to give people time to respond (two weeks if it
> > is around holidays and such).
> 
> The problem is we are talking about hardware/FPGA, not software.
> Adding a feature bit means going back and updating RTL. The software
> side of things is easy, re-validating things after a hardware/FPGA
> change not so much.
> 
> If this is a hard requirement I may just drop the virtio patch, push
> what I have, and leave it to Mark/Dan to deal with the necessary RTL
> and code changes needed to support Virtio as I don't expect the
> turnaround to be as easy as just a patch.
> 
> Thanks.
> 
> - Alex

Let's focus on virtio in this thread.

Involving the virtio TC in host/guest interface changes is a
hard requirement. It's just too easy to create conflicts otherwise.

So you guys should have just sent the proposal to the TC when you
were doing your RTL and you would have been in the clear.

Generally adding a feature bit with any extension is a good idea:
this way you merely reserve a feature bit for your feature through
the TC and are more or less sure of forward and backward compatibility.
It's incredibly easy.

But maybe it's not needed here.  I am not making the decisions myself.
Not too late: post to the TC list and let's see what the response is.
Without a feature bit you are making a change affecting all future
implementations without exception so the bar is a bit higher: you need
to actually post a spec text proposal not just a patch showing how to
use the feature, and TC needs to vote on it. Voting takes a week,
review a week or two depending on change complexity.

Hope this helps,

-- 
MST

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-20 15:28 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, davem, netdev, virtualization, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-5-git-send-email-sridhar.samudrala@intel.com>

On Thu, 19 Apr 2018 18:42:04 -0700
Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:

> Use the registration/notification framework supported by the generic
> failover infrastructure.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

Do what you want to other devices but leave netvsc alone.
Adding these failover ops does not reduce the code size, and really is
no benefit.  The netvsc device driver needs to be backported to several
other distributions and doing this makes that harder.

I will NAK patches to change to common code for netvsc especially the
three device model.  MS worked hard with distro vendors to support transparent
mode, ans we really can't have a new model; or do backport.

Plus, DPDK is now dependent on existing model.

^ permalink raw reply

* Re: [PATCH net-next 0/2] qed* : Use trust mode to override forced MAC
From: David Miller @ 2018-04-20 15:26 UTC (permalink / raw)
  To: shahed.shaikh; +Cc: netdev, Ariel.Elior, Dept-EngEverestLinuxL2
In-Reply-To: <20180419125012.20503-1-shahed.shaikh@cavium.com>

From: Shahed Shaikh <shahed.shaikh@cavium.com>
Date: Thu, 19 Apr 2018 05:50:10 -0700

> This patchset adds a support to override forced MAC (MAC set by PF for a VF)
> when trust mode is enabled using
> #ip link set dev <pf> vf <vf id> trust on
> 
> First patch adds a real change to use .ndo_set_vf_trust to override forced MAC
> and allow user to change VFs from VF interface itself.
> 
> Second patch takes care of a corner case, where MAC change from VF won't
> take effect when VF interface is down, by introducing a new TLV
> (a way to send message from VF to PF) to give a hint to PF to update
> its bulletin board.
> 
> Please apply this series to net-next.

Series applied.

^ permalink raw reply

* Re: [PATCH v3 net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Yafang Shao @ 2018-04-20 15:24 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Alexei Starovoitov, Song Liu, netdev, LKML
In-Reply-To: <20180420.112106.514748261967848944.davem@davemloft.net>

On Fri, Apr 20, 2018 at 11:21 PM, David Miller <davem@davemloft.net> wrote:
>
> Why are you sending this same patch twice?
>
> Thank you.

Some mistake.
Sorry about that.

Pls. use the second patch.

Thanks
Yafang

^ permalink raw reply

* Re: [net-next 1/3] tipc: set default MTU for UDP media
From: kbuild test robot @ 2018-04-20 15:22 UTC (permalink / raw)
  To: GhantaKrishnamurthy MohanKrishna
  Cc: kbuild-all, tipc-discussion, jon.maloy, maloy, ying.xue,
	mohan.krishna.ghanta.krishnamurthy, netdev, davem
In-Reply-To: <1524128780-2550-2-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>

[-- Attachment #1: Type: text/plain, Size: 5389 bytes --]

Hi GhantaKrishnamurthy,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/GhantaKrishnamurthy-MohanKrishna/tipc-Confgiuration-of-MTU-for-media-UDP/20180420-224412
config: x86_64-randconfig-x018-201815 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

Note: the linux-review/GhantaKrishnamurthy-MohanKrishna/tipc-Confgiuration-of-MTU-for-media-UDP/20180420-224412 HEAD 5757244a45c9114ee8a7ed60e9b074107605f6eb builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

   net//tipc/udp_media.c: In function 'tipc_udp_enable':
>> net//tipc/udp_media.c:716:20: error: 'struct tipc_media' has no member named 'mtu'
      b->mtu = b->media->mtu;
                       ^~
   net//tipc/udp_media.c: At top level:
   net//tipc/udp_media.c:805:3: error: 'struct tipc_media' has no member named 'mtu'
     .mtu  = TIPC_DEF_LINK_UDP_MTU,
      ^~~

vim +716 net//tipc/udp_media.c

   632	
   633	/**
   634	 * tipc_udp_enable - callback to create a new udp bearer instance
   635	 * @net:	network namespace
   636	 * @b:		pointer to generic tipc_bearer
   637	 * @attrs:	netlink bearer configuration
   638	 *
   639	 * validate the bearer parameters and initialize the udp bearer
   640	 * rtnl_lock should be held
   641	 */
   642	static int tipc_udp_enable(struct net *net, struct tipc_bearer *b,
   643				   struct nlattr *attrs[])
   644	{
   645		int err = -EINVAL;
   646		struct udp_bearer *ub;
   647		struct udp_media_addr remote = {0};
   648		struct udp_media_addr local = {0};
   649		struct udp_port_cfg udp_conf = {0};
   650		struct udp_tunnel_sock_cfg tuncfg = {NULL};
   651		struct nlattr *opts[TIPC_NLA_UDP_MAX + 1];
   652		u8 node_id[NODE_ID_LEN] = {0,};
   653	
   654		ub = kzalloc(sizeof(*ub), GFP_ATOMIC);
   655		if (!ub)
   656			return -ENOMEM;
   657	
   658		INIT_LIST_HEAD(&ub->rcast.list);
   659	
   660		if (!attrs[TIPC_NLA_BEARER_UDP_OPTS])
   661			goto err;
   662	
   663		if (nla_parse_nested(opts, TIPC_NLA_UDP_MAX,
   664				     attrs[TIPC_NLA_BEARER_UDP_OPTS],
   665				     tipc_nl_udp_policy, NULL))
   666			goto err;
   667	
   668		if (!opts[TIPC_NLA_UDP_LOCAL] || !opts[TIPC_NLA_UDP_REMOTE]) {
   669			pr_err("Invalid UDP bearer configuration");
   670			err = -EINVAL;
   671			goto err;
   672		}
   673	
   674		err = tipc_parse_udp_addr(opts[TIPC_NLA_UDP_LOCAL], &local,
   675					  &ub->ifindex);
   676		if (err)
   677			goto err;
   678	
   679		err = tipc_parse_udp_addr(opts[TIPC_NLA_UDP_REMOTE], &remote, NULL);
   680		if (err)
   681			goto err;
   682	
   683		/* Autoconfigure own node identity if needed */
   684		if (!tipc_own_id(net)) {
   685			memcpy(node_id, local.ipv6.in6_u.u6_addr8, 16);
   686			tipc_net_init(net, node_id, 0);
   687		}
   688		if (!tipc_own_id(net)) {
   689			pr_warn("Failed to set node id, please configure manually\n");
   690			err = -EINVAL;
   691			goto err;
   692		}
   693	
   694		b->bcast_addr.media_id = TIPC_MEDIA_TYPE_UDP;
   695		b->bcast_addr.broadcast = TIPC_BROADCAST_SUPPORT;
   696		rcu_assign_pointer(b->media_ptr, ub);
   697		rcu_assign_pointer(ub->bearer, b);
   698		tipc_udp_media_addr_set(&b->addr, &local);
   699		if (local.proto == htons(ETH_P_IP)) {
   700			struct net_device *dev;
   701	
   702			dev = __ip_dev_find(net, local.ipv4.s_addr, false);
   703			if (!dev) {
   704				err = -ENODEV;
   705				goto err;
   706			}
   707			udp_conf.family = AF_INET;
   708			udp_conf.local_ip.s_addr = htonl(INADDR_ANY);
   709			udp_conf.use_udp_checksums = false;
   710			ub->ifindex = dev->ifindex;
   711			if (tipc_mtu_bad(dev, sizeof(struct iphdr) +
   712					      sizeof(struct udphdr))) {
   713				err = -EINVAL;
   714				goto err;
   715			}
 > 716			b->mtu = b->media->mtu;
   717	#if IS_ENABLED(CONFIG_IPV6)
   718		} else if (local.proto == htons(ETH_P_IPV6)) {
   719			udp_conf.family = AF_INET6;
   720			udp_conf.use_udp6_tx_checksums = true;
   721			udp_conf.use_udp6_rx_checksums = true;
   722			udp_conf.local_ip6 = in6addr_any;
   723			b->mtu = 1280;
   724	#endif
   725		} else {
   726			err = -EAFNOSUPPORT;
   727			goto err;
   728		}
   729		udp_conf.local_udp_port = local.port;
   730		err = udp_sock_create(net, &udp_conf, &ub->ubsock);
   731		if (err)
   732			goto err;
   733		tuncfg.sk_user_data = ub;
   734		tuncfg.encap_type = 1;
   735		tuncfg.encap_rcv = tipc_udp_recv;
   736		tuncfg.encap_destroy = NULL;
   737		setup_udp_tunnel_sock(net, ub->ubsock, &tuncfg);
   738	
   739		/**
   740		 * The bcast media address port is used for all peers and the ip
   741		 * is used if it's a multicast address.
   742		 */
   743		memcpy(&b->bcast_addr.value, &remote, sizeof(remote));
   744		if (tipc_udp_is_mcast_addr(&remote))
   745			err = enable_mcast(ub, &remote);
   746		else
   747			err = tipc_udp_rcast_add(b, &remote);
   748		if (err)
   749			goto err;
   750	
   751		return 0;
   752	err:
   753		if (ub->ubsock)
   754			udp_tunnel_sock_release(ub->ubsock);
   755		kfree(ub);
   756		return err;
   757	}
   758	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27623 bytes --]

^ permalink raw reply

* Re: [PATCH 4/8] sh_eth: Change platform check to CONFIG_ARCH_RENESAS
From: Sergei Shtylyov @ 2018-04-20 15:22 UTC (permalink / raw)
  To: Geert Uytterhoeven, Simon Horman, Magnus Damm, Russell King,
	Catalin Marinas, Will Deacon, Dan Williams, Vinod Koul,
	Mauro Carvalho Chehab, David S . Miller, Greg Kroah-Hartman,
	Liam Girdwood, Mark Brown, Jaroslav Kysela, Takashi Iwai,
	Arnd Bergmann, Kuninori Morimoto, Laurent Pinchart
  Cc: devel, alsa-devel, netdev, linux-kernel, linux-renesas-soc,
	dmaengine, linux-arm-kernel, linux-media
In-Reply-To: <1524230914-10175-5-git-send-email-geert+renesas@glider.be>

On 04/20/2018 04:28 PM, Geert Uytterhoeven wrote:

> Since commit 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce ARCH_RENESAS")
> is CONFIG_ARCH_RENESAS a more appropriate platform check than the legacy
> CONFIG_ARCH_SHMOBILE, hence use the former.
> 
> Renesas SuperH SH-Mobile SoCs are still covered by the CONFIG_CPU_SH4
> check.
> 
> This will allow to drop ARCH_SHMOBILE on ARM and ARM64 in the near
> future.
> 
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
[...]

Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

MBR, Sergei

^ permalink raw reply

* Re: [PATCH v3 net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: David Miller @ 2018-04-20 15:21 UTC (permalink / raw)
  To: laoar.shao
  Cc: eric.dumazet, alexei.starovoitov, songliubraving, netdev,
	linux-kernel
In-Reply-To: <1524237506-11011-1-git-send-email-laoar.shao@gmail.com>


Why are you sending this same patch twice?

Thank you.

^ permalink raw reply

* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Samudrala, Sridhar @ 2018-04-20 15:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri
In-Reply-To: <20180420045657-mutt-send-email-mst@kernel.org>


On 4/19/2018 7:44 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 19, 2018 at 06:42:02PM -0700, Sridhar Samudrala wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation.
>>
>> It exposes 2 sets of interfaces to the paravirtual drivers.
>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> master netdev is created. The paravirtual driver registers each instance
>> of netvsc as a 'failover' instance  along with a set of ops to manage the
>> slave events.
>>       failover_register()
>>       failover_unregister()
>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> the failover module provides interfaces to create/destroy additional master
>> netdev and all the slave events are managed internally.
>>        failover_create()
>>        failover_destroy()
>> These functions call failover_register()/failover_unregister() with the
>> master netdev created by the failover module.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> I like this patch. Yes something to improve (see below)
>
>> ---
>>   include/linux/netdevice.h |  16 +
>>   include/net/failover.h    |  96 ++++++
>>   net/Kconfig               |  18 +
>>   net/core/Makefile         |   1 +
>>   net/core/failover.c       | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>>   5 files changed, 975 insertions(+)
>>   create mode 100644 include/net/failover.h
>>   create mode 100644 net/core/failover.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cf44503ea81a..ed535b6724e1 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1401,6 +1401,8 @@ struct net_device_ops {
>>    *	entity (i.e. the master device for bridged veth)
>>    * @IFF_MACSEC: device is a MACsec device
>>    * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
>> + * @IFF_FAILOVER: device is a failover master device
>> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>>    */
>>   enum netdev_priv_flags {
>>   	IFF_802_1Q_VLAN			= 1<<0,
>> @@ -1430,6 +1432,8 @@ enum netdev_priv_flags {
>>   	IFF_PHONY_HEADROOM		= 1<<24,
>>   	IFF_MACSEC			= 1<<25,
>>   	IFF_NO_RX_HANDLER		= 1<<26,
>> +	IFF_FAILOVER			= 1<<27,
>> +	IFF_FAILOVER_SLAVE		= 1<<28,
>>   };
>>   
>>   #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> @@ -1458,6 +1462,8 @@ enum netdev_priv_flags {
>>   #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>>   #define IFF_MACSEC			IFF_MACSEC
>>   #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> +#define IFF_FAILOVER			IFF_FAILOVER
>> +#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
>>   
>>   /**
>>    *	struct net_device - The DEVICE structure.
>> @@ -4308,6 +4314,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>>   	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>>   }
>>   
>> +static inline bool netif_is_failover(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_FAILOVER;
>> +}
>> +
>> +static inline bool netif_is_failover_slave(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_FAILOVER_SLAVE;
>> +}
>> +
>>   /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>>   static inline void netif_keep_dst(struct net_device *dev)
>>   {
>> diff --git a/include/net/failover.h b/include/net/failover.h
>> new file mode 100644
>> index 000000000000..0b8601043d90
>> --- /dev/null
>> +++ b/include/net/failover.h
>> @@ -0,0 +1,96 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_FAILOVER_H
>> +#define _NET_FAILOVER_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct failover_ops {
>> +	int (*slave_pre_register)(struct net_device *slave_dev,
>> +				  struct net_device *failover_dev);
>> +	int (*slave_join)(struct net_device *slave_dev,
>> +			  struct net_device *failover_dev);
>> +	int (*slave_pre_unregister)(struct net_device *slave_dev,
>> +				    struct net_device *failover_dev);
>> +	int (*slave_release)(struct net_device *slave_dev,
>> +			     struct net_device *failover_dev);
>> +	int (*slave_link_change)(struct net_device *slave_dev,
>> +				 struct net_device *failover_dev);
>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct failover {
>> +	struct list_head list;
>> +	struct net_device __rcu *failover_dev;
>> +	struct failover_ops __rcu *ops;
>> +};
>> +
>> +/* failover state */
>> +struct failover_info {
>> +	/* primary netdev with same MAC */
>> +	struct net_device __rcu *primary_dev;
>> +
>> +	/* standby netdev */
>> +	struct net_device __rcu *standby_dev;
>> +
>> +	/* primary netdev stats */
>> +	struct rtnl_link_stats64 primary_stats;
>> +
>> +	/* standby netdev stats */
>> +	struct rtnl_link_stats64 standby_stats;
>> +
>> +	/* aggregated stats */
>> +	struct rtnl_link_stats64 failover_stats;
>> +
>> +	/* spinlock while updating stats */
>> +	spinlock_t stats_lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_FAILOVER)
>> +
>> +int failover_create(struct net_device *standby_dev,
>> +		    struct failover **pfailover);
>> +void failover_destroy(struct failover *failover);
>> +
>> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
>> +		      struct failover **pfailover);
>> +void failover_unregister(struct failover *failover);
>> +
>> +int failover_slave_unregister(struct net_device *slave_dev);
>> +
>> +#else
>> +
>> +static inline
>> +int failover_create(struct net_device *standby_dev,
>> +		    struct failover **pfailover);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void failover_destroy(struct failover *failover)
>> +{
>> +}
>> +
>> +static inline
>> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
>> +		      struct pfailover **pfailover);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void failover_unregister(struct failover *failover)
>> +{
>> +}
>> +
>> +static inline
>> +int failover_slave_unregister(struct net_device *slave_dev)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_FAILOVER_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..388b99dfee10 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>>   	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>>   	  devlink is a loadable module and the driver using it is built-in.
>>   
>> +config NET_FAILOVER
>> +	tristate "Failover interface"
>> +	help
>> +	  This provides a generic interface for paravirtual drivers to listen
>> +	  for netdev register/unregister/link change events from pci ethernet
>> +	  devices with the same MAC and takeover their datapath. This also
>> +	  enables live migration of a VM with direct attached VF by failing
>> +	  over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_FAILOVER
>> +	tristate
>> +	default m if NET_FAILOVER=m
>> +	default y if NET_FAILOVER=y || NET_FAILOVER=n
>> +	help
>> +	  Drivers using the failover infrastructure should have a dependency
>> +	  on MAY_USE_FAILOVER to ensure they do not cause link errors when
>> +	  failover is a loadable module and the driver using it is built-in.
>> +
>>   endif   # if NET
>>   
>>   # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..cef17518bb7d 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>>   obj-$(CONFIG_HWBM) += hwbm.o
>>   obj-$(CONFIG_NET_DEVLINK) += devlink.o
>>   obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_FAILOVER) += failover.o
>> diff --git a/net/core/failover.c b/net/core/failover.c
>> new file mode 100644
>> index 000000000000..7bee762cb737
>> --- /dev/null
>> +++ b/net/core/failover.c
>> @@ -0,0 +1,844 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>
> I think you should copy the header from bond_main.c upon which
> some of the code seems to be based.

Yes. some of the code is based on bonding/team and netvsc. i added a reference
to netvsc in the comment, will also include bonding/team driver.

>
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/pci.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/failover.h>
>> +
>> +static LIST_HEAD(failover_list);
>> +static DEFINE_SPINLOCK(failover_lock);
>> +
>> +static int failover_slave_pre_register(struct net_device *slave_dev,
>> +				       struct net_device *failover_dev,
>> +				       struct failover_ops *failover_ops)
>> +{
>> +	struct failover_info *finfo;
>> +	bool standby;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_pre_register)
>> +			return -EINVAL;
>> +
>> +		return failover_ops->slave_pre_register(slave_dev,
>> +							failover_dev);
>> +	}
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +	standby = (slave_dev->dev.parent == failover_dev->dev.parent);
>> +	if (standby ? rtnl_dereference(finfo->standby_dev) :
>> +			rtnl_dereference(finfo->primary_dev)) {
>> +		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
>> +			   slave_dev->name, standby ? "standby" : "primary");
>> +		return -EEXIST;
>> +	}
>> +
>> +	/* Avoid non pci devices as primary netdev */
> Why? Pls change this comment so it explains the motivation
> rather than just repeat what the code does.

OK.

>
>> +	if (!standby && (!slave_dev->dev.parent ||
>> +			 !dev_is_pci(slave_dev->dev.parent)))
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int failover_slave_join(struct net_device *slave_dev,
>> +			       struct net_device *failover_dev,
>> +			       struct failover_ops *failover_ops)
>> +{
>> +	struct failover_info *finfo;
>> +	int err, orig_mtu;
>> +	bool standby;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_join)
>> +			return -EINVAL;
>> +
>> +		return failover_ops->slave_join(slave_dev, failover_dev);
>> +	}
>> +
>> +	if (netif_running(failover_dev)) {
>> +		err = dev_open(slave_dev);
>> +		if (err && (err != -EBUSY)) {
>> +			netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
>> +				   slave_dev->name, err);
>> +			goto err_dev_open;
>> +		}
>> +	}
>> +
>> +	/* Align MTU of slave with failover dev */
>> +	orig_mtu = slave_dev->mtu;
> I suspect this was copied from bond. this variable is never
> used and I'm even surprised gcc did not warn about this.

Thanks for catching, I broke this when i moved the dev_open() and dev_set_mtu()
calls from register to join. I need to reset slave_dev mtu to orig_mtu on failure.


>
>
>> +	err = dev_set_mtu(slave_dev, failover_dev->mtu);
> How do we know slave supports this MTU? same applies to
> failover_change_mtu.

The err check below should catch it and we will reset the mtu back and
fail the join/register.

>
>
>
>
>> +	if (err) {
>> +		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
>> +			   slave_dev->name, failover_dev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +	standby = (slave_dev->dev.parent == failover_dev->dev.parent);
>> +
>> +	dev_hold(slave_dev);
>> +
>> +	if (standby) {
>> +		rcu_assign_pointer(finfo->standby_dev, slave_dev);
>> +		dev_get_stats(finfo->standby_dev, &finfo->standby_stats);
>> +	} else {
>> +		rcu_assign_pointer(finfo->primary_dev, slave_dev);
>> +		dev_get_stats(finfo->primary_dev, &finfo->primary_stats);
>> +		failover_dev->min_mtu = slave_dev->min_mtu;
>> +		failover_dev->max_mtu = slave_dev->max_mtu;
>> +	}
>> +
>> +	netdev_info(failover_dev, "failover slave:%s joined\n",
>> +		    slave_dev->name);
>> +
>> +	return 0;
>> +
>> +err_set_mtu:
>> +	dev_close(slave_dev);
>> +err_dev_open:
>> +	return err;
>> +}
>> +
>> +/* Called when slave dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t failover_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> +	skb->dev = ndev;
>> +
>> +	return RX_HANDLER_ANOTHER;
>> +}
>> +
>> +static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
>> +{
>> +	struct net_device *failover_dev;
>> +	struct failover *failover;
>> +
>> +	spin_lock(&failover_lock);
>> +	list_for_each_entry(failover, &failover_list, list) {
>> +		failover_dev = rtnl_dereference(failover->failover_dev);
>> +		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
>> +			*ops = rtnl_dereference(failover->ops);
>> +			spin_unlock(&failover_lock);
>> +			return failover_dev;
>> +		}
>> +	}
>> +	spin_unlock(&failover_lock);
>> +	return NULL;
>> +}
>> +
>> +static int failover_slave_register(struct net_device *slave_dev)
>> +{
>> +	struct failover_ops *failover_ops;
>> +	struct net_device *failover_dev;
>> +	int ret;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> +	if (!failover_dev)
>> +		goto done;
>> +
>> +	ret = failover_slave_pre_register(slave_dev, failover_dev,
>> +					  failover_ops);
>> +	if (ret)
>> +		goto done;
>> +
>> +	ret = netdev_rx_handler_register(slave_dev, failover_ops ?
>> +					 failover_ops->handle_frame :
>> +					 failover_handle_frame, failover_dev);
>> +	if (ret) {
>> +		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
>> +			   ret);
>> +		goto done;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> +	if (ret) {
>> +		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
>> +			   failover_dev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
>> +
>> +	ret = failover_slave_join(slave_dev, failover_dev, failover_ops);
>> +	if (ret)
>> +		goto err_join;
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
>> +
>> +	netdev_info(failover_dev, "failover slave:%s registered\n",
>> +		    slave_dev->name);
>> +
>> +	goto done;
>> +
>> +err_join:
>> +	netdev_upper_dev_unlink(slave_dev, failover_dev);
>> +	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(slave_dev);
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int failover_slave_pre_unregister(struct net_device *slave_dev,
>> +					 struct net_device *failover_dev,
>> +					 struct failover_ops *failover_ops)
>> +{
>> +	struct net_device *standby_dev, *primary_dev;
>> +	struct failover_info *finfo;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_pre_unregister)
>> +			return -EINVAL;
>> +
>> +		return failover_ops->slave_pre_unregister(slave_dev,
>> +							  failover_dev);
>> +	}
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +	primary_dev = rtnl_dereference(finfo->primary_dev);
>> +	standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> +	if (slave_dev != primary_dev && slave_dev != standby_dev)
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int failover_slave_release(struct net_device *slave_dev,
>> +				  struct net_device *failover_dev,
>> +				  struct failover_ops *failover_ops)
>> +{
>> +	struct net_device *standby_dev, *primary_dev;
>> +	struct failover_info *finfo;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_release)
>> +			return -EINVAL;
>> +
>> +		return failover_ops->slave_release(slave_dev, failover_dev);
>> +	}
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +	primary_dev = rtnl_dereference(finfo->primary_dev);
>> +	standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> +	if (slave_dev == standby_dev) {
>> +		RCU_INIT_POINTER(finfo->standby_dev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(finfo->primary_dev, NULL);
>> +		if (standby_dev) {
>> +			failover_dev->min_mtu = standby_dev->min_mtu;
>> +			failover_dev->max_mtu = standby_dev->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(slave_dev);
>> +
>> +	netdev_info(failover_dev, "failover slave:%s released\n",
>> +		    slave_dev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +int failover_slave_unregister(struct net_device *slave_dev)
>> +{
>> +	struct failover_ops *failover_ops;
>> +	struct net_device *failover_dev;
>> +	int ret;
>> +
>> +	if (!netif_is_failover_slave(slave_dev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> +	if (!failover_dev)
>> +		goto done;
>> +
>> +	ret = failover_slave_pre_unregister(slave_dev, failover_dev,
>> +					    failover_ops);
>> +	if (ret)
>> +		goto done;
>> +
>> +	netdev_rx_handler_unregister(slave_dev);
>> +	netdev_upper_dev_unlink(slave_dev, failover_dev);
>> +	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
>> +
>> +	failover_slave_release(slave_dev, failover_dev, failover_ops);
>
> Don't you need to get stats from it? This device is going away ...

Yes. we need to update the failover_stats before the slave goes away.

>
>> +
>> +	netdev_info(failover_dev, "failover slave:%s unregistered\n",
>> +		    slave_dev->name);
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_slave_unregister);
>> +
>> +static bool failover_xmit_ready(struct net_device *dev)
>> +{
>> +	return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static int failover_slave_link_change(struct net_device *slave_dev)
>> +{
>> +	struct net_device *failover_dev, *primary_dev, *standby_dev;
>> +	struct failover_ops *failover_ops;
>> +	struct failover_info *finfo;
>> +
>> +	if (!netif_is_failover_slave(slave_dev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> +	if (!failover_dev)
>> +		goto done;
>> +
>> +	if (failover_ops) {
>> +		if (!failover_ops->slave_link_change)
>> +			goto done;
>> +
>> +		return failover_ops->slave_link_change(slave_dev, failover_dev);
>> +	}
>> +
>> +	if (!netif_running(failover_dev))
>> +		return 0;
>> +
>> +	finfo = netdev_priv(failover_dev);
>> +
>> +	primary_dev = rtnl_dereference(finfo->primary_dev);
>> +	standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> +	if (slave_dev != primary_dev && slave_dev != standby_dev)
>> +		goto done;
>> +
>> +	if ((primary_dev && failover_xmit_ready(primary_dev)) ||
>> +	    (standby_dev && failover_xmit_ready(standby_dev))) {
>> +		netif_carrier_on(failover_dev);
>> +		netif_tx_wake_all_queues(failover_dev);
>> +	} else {
>> +		netif_carrier_off(failover_dev);
>> +		netif_tx_stop_all_queues(failover_dev);
> And I think it's a good idea to get stats from device here too.

Not sure why we need to get stats from lower devs here?


>
>
>> +	}
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static bool failover_validate_event_dev(struct net_device *dev)
>> +{
>> +	/* Skip parent events */
>> +	if (netif_is_failover(dev))
>> +		return false;
>> +
>> +	/* Avoid non-Ethernet type devices */
> ... for now. It would be possible easy to make this generic -
> just copy things like type and addr_len from slave.

ok. failover_create can copy these values from standby_dev and
we can validate that they match when primary_dev registers.

>
>> +	if (dev->type != ARPHRD_ETHER)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int
>> +failover_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	if (!failover_validate_event_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return failover_slave_register(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return failover_slave_unregister(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return failover_slave_link_change(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
>> +
>> +static struct notifier_block failover_notifier = {
>> +	.notifier_call = failover_event,
>> +};
>> +
>> +static int failover_open(struct net_device *dev)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *primary_dev, *standby_dev;
>> +	int err;
>> +
>> +	netif_carrier_off(dev);
>> +	netif_tx_wake_all_queues(dev);
>> +
>> +	primary_dev = rtnl_dereference(finfo->primary_dev);
>> +	if (primary_dev) {
>> +		err = dev_open(primary_dev);
>> +		if (err)
>> +			goto err_primary_open;
>> +	}
>> +
>> +	standby_dev = rtnl_dereference(finfo->standby_dev);
>> +	if (standby_dev) {
>> +		err = dev_open(standby_dev);
>> +		if (err)
>> +			goto err_standby_open;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_standby_open:
>> +	dev_close(primary_dev);
>> +err_primary_open:
>> +	netif_tx_disable(dev);
>> +	return err;
>> +}
>> +
>> +static int failover_close(struct net_device *dev)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *slave_dev;
>> +
>> +	netif_tx_disable(dev);
>> +
>> +	slave_dev = rtnl_dereference(finfo->primary_dev);
>> +	if (slave_dev)
>> +		dev_close(slave_dev);
>> +
>> +	slave_dev = rtnl_dereference(finfo->standby_dev);
>> +	if (slave_dev)
>> +		dev_close(slave_dev);
>> +
>> +	return 0;
>> +}
>> +
>> +static netdev_tx_t failover_drop_xmit(struct sk_buff *skb,
>> +				      struct net_device *dev)
>> +{
>> +	atomic_long_inc(&dev->tx_dropped);
>> +	dev_kfree_skb_any(skb);
>> +	return NETDEV_TX_OK;
>> +}
>> +
>> +static netdev_tx_t failover_start_xmit(struct sk_buff *skb,
>> +				       struct net_device *dev)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *xmit_dev;
>> +
>> +	/* Try xmit via primary netdev followed by standby netdev */
>> +	xmit_dev = rcu_dereference_bh(finfo->primary_dev);
>> +	if (!xmit_dev || !failover_xmit_ready(xmit_dev)) {
>> +		xmit_dev = rcu_dereference_bh(finfo->standby_dev);
>> +		if (!xmit_dev || !failover_xmit_ready(xmit_dev))
>> +			return failover_drop_xmit(skb, dev);
>> +	}
>> +
>> +	skb->dev = xmit_dev;
>> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> +	return dev_queue_xmit(skb);
>> +}
> Is this going through qdisc twice? Won't this hurt performance
> measureably?

The failover dev has no queue (IFF_NO_QUEUE) , so doesn't go through qdisc twice.

>
>> +
>> +static u16 failover_select_queue(struct net_device *dev, struct sk_buff *skb,
>> +				 void *accel_priv,
>> +				 select_queue_fallback_t fallback)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *primary_dev;
>> +	u16 txq;
>> +
>> +	rcu_read_lock();
>> +	primary_dev = rcu_dereference(finfo->primary_dev);
>> +	if (primary_dev) {
>> +		const struct net_device_ops *ops = primary_dev->netdev_ops;
>> +
>> +		if (ops->ndo_select_queue)
>> +			txq = ops->ndo_select_queue(primary_dev, skb,
>> +						    accel_priv, fallback);
>> +		else
>> +			txq = fallback(primary_dev, skb);
>> +
>> +		qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +		return txq;
>> +	}
>> +
>> +	txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> +	/* Save the original txq to restore before passing to the driver */
>> +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +	if (unlikely(txq >= dev->real_num_tx_queues)) {
>> +		do {
>> +			txq -= dev->real_num_tx_queues;
>> +		} while (txq >= dev->real_num_tx_queues);
>> +	}
>> +
>> +	return txq;
>> +}
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32finfot values only.
>> + */
>> +static void failover_fold_stats(struct rtnl_link_stats64 *_res,
>> +				const struct rtnl_link_stats64 *_new,
>> +				const struct rtnl_link_stats64 *_old)
>> +{
>> +	const u64 *new = (const u64 *)_new;
>> +	const u64 *old = (const u64 *)_old;
>> +	u64 *res = (u64 *)_res;
>> +	int i;
>> +
>> +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> +		u64 nv = new[i];
>> +		u64 ov = old[i];
>> +		s64 delta = nv - ov;
>> +
>> +		/* detects if this particular field is 32bit only */
>> +		if (((nv | ov) >> 32) == 0)
>> +			delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> +		/* filter anomalies, some drivers reset their stats
>> +		 * at down/up events.
>> +		 */
>> +		if (delta > 0)
>> +			res[i] += delta;
>> +	}
>> +}
>> +
>> +static void failover_get_stats(struct net_device *dev,
>> +			       struct rtnl_link_stats64 *stats)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	const struct rtnl_link_stats64 *new;
>> +	struct rtnl_link_stats64 temp;
>> +	struct net_device *slave_dev;
>> +
>> +	spin_lock(&finfo->stats_lock);
>> +	memcpy(stats, &finfo->failover_stats, sizeof(*stats));
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_dev = rcu_dereference(finfo->primary_dev);
>> +	if (slave_dev) {
>> +		new = dev_get_stats(slave_dev, &temp);
>> +		failover_fold_stats(stats, new, &finfo->primary_stats);
>> +		memcpy(&finfo->primary_stats, new, sizeof(*new));
>> +	}
>> +
>> +	slave_dev = rcu_dereference(finfo->standby_dev);
>> +	if (slave_dev) {
>> +		new = dev_get_stats(slave_dev, &temp);
>> +		failover_fold_stats(stats, new, &finfo->standby_stats);
>> +		memcpy(&finfo->standby_stats, new, sizeof(*new));
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	memcpy(&finfo->failover_stats, stats, sizeof(*stats));
>> +	spin_unlock(&finfo->stats_lock);
>> +}
>> +
>> +static int failover_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *primary_dev, *standby_dev;
>> +	int ret = 0;
>> +
>> +	primary_dev = rcu_dereference(finfo->primary_dev);
>> +	if (primary_dev) {
>> +		ret = dev_set_mtu(primary_dev, new_mtu);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	standby_dev = rcu_dereference(finfo->standby_dev);
>> +	if (standby_dev) {
>> +		ret = dev_set_mtu(standby_dev, new_mtu);
>> +		if (ret) {
>> +			dev_set_mtu(primary_dev, dev->mtu);
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	dev->mtu = new_mtu;
>> +	return 0;
>> +}
>> +
>> +static void failover_set_rx_mode(struct net_device *dev)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *slave_dev;
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_dev = rcu_dereference(finfo->primary_dev);
>> +	if (slave_dev) {
>> +		dev_uc_sync_multiple(slave_dev, dev);
>> +		dev_mc_sync_multiple(slave_dev, dev);
>> +	}
>> +
>> +	slave_dev = rcu_dereference(finfo->standby_dev);
>> +	if (slave_dev) {
>> +		dev_uc_sync_multiple(slave_dev, dev);
>> +		dev_mc_sync_multiple(slave_dev, dev);
>> +	}
>> +
>> +	rcu_read_unlock();
>> +}
>> +
>> +static const struct net_device_ops failover_dev_ops = {
>> +	.ndo_open		= failover_open,
>> +	.ndo_stop		= failover_close,
>> +	.ndo_start_xmit		= failover_start_xmit,
>> +	.ndo_select_queue	= failover_select_queue,
>> +	.ndo_get_stats64	= failover_get_stats,
>> +	.ndo_change_mtu		= failover_change_mtu,
>> +	.ndo_set_rx_mode	= failover_set_rx_mode,
>> +	.ndo_validate_addr	= eth_validate_addr,
>> +	.ndo_features_check	= passthru_features_check,
> xdp support?

I think it should be possible to add it be calling the lower dev ndo_xdp routines
with proper checks. can we add this later?

>
>> +};
>> +
>> +#define FAILOVER_NAME "failover"
>> +#define FAILOVER_VERSION "0.1"
>> +
>> +static void failover_ethtool_get_drvinfo(struct net_device *dev,
>> +					 struct ethtool_drvinfo *drvinfo)
>> +{
>> +	strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
>> +	strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +int failover_ethtool_get_link_ksettings(struct net_device *dev,
>> +					struct ethtool_link_ksettings *cmd)
>> +{
>> +	struct failover_info *finfo = netdev_priv(dev);
>> +	struct net_device *slave_dev;
>> +
>> +	slave_dev = rtnl_dereference(finfo->primary_dev);
>> +	if (!slave_dev || !failover_xmit_ready(slave_dev)) {
>> +		slave_dev = rtnl_dereference(finfo->standby_dev);
>> +		if (!slave_dev || !failover_xmit_ready(slave_dev)) {
>> +			cmd->base.duplex = DUPLEX_UNKNOWN;
>> +			cmd->base.port = PORT_OTHER;
>> +			cmd->base.speed = SPEED_UNKNOWN;
>> +
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	return __ethtool_get_link_ksettings(slave_dev, cmd);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_ethtool_get_link_ksettings);
>> +
>> +static const struct ethtool_ops failover_ethtool_ops = {
>> +	.get_drvinfo            = failover_ethtool_get_drvinfo,
>> +	.get_link               = ethtool_op_get_link,
>> +	.get_link_ksettings     = failover_ethtool_get_link_ksettings,
>> +};
>> +
>> +static void failover_register_existing_slave(struct net_device *failover_dev)
>> +{
>> +	struct net *net = dev_net(failover_dev);
>> +	struct net_device *dev;
>> +
>> +	rtnl_lock();
>> +	for_each_netdev(net, dev) {
>> +		if (dev == failover_dev)
>> +			continue;
>> +		if (!failover_validate_event_dev(dev))
>> +			continue;
>> +		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
>> +			failover_slave_register(dev);
>> +	}
>> +	rtnl_unlock();
>> +}
>> +
>> +int failover_register(struct net_device *dev, struct failover_ops *ops,
>> +		      struct failover **pfailover)
>> +{
>> +	struct failover *failover;
>> +
>> +	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
>> +	if (!failover)
>> +		return -ENOMEM;
>> +
>> +	rcu_assign_pointer(failover->ops, ops);
>> +	dev_hold(dev);
>> +	dev->priv_flags |= IFF_FAILOVER;
>> +	rcu_assign_pointer(failover->failover_dev, dev);
>> +
>> +	spin_lock(&failover_lock);
>> +	list_add_tail(&failover->list, &failover_list);
>> +	spin_unlock(&failover_lock);
>> +
>> +	failover_register_existing_slave(dev);
>> +
>> +	*pfailover = failover;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_register);
>> +
>> +void failover_unregister(struct failover *failover)
>> +{
>> +	struct net_device *failover_dev;
>> +
>> +	failover_dev = rcu_dereference(failover->failover_dev);
>> +
>> +	failover_dev->priv_flags &= ~IFF_FAILOVER;
>> +	dev_put(failover_dev);
>> +
>> +	spin_lock(&failover_lock);
>> +	list_del(&failover->list);
>> +	spin_unlock(&failover_lock);
>> +
>> +	kfree(failover);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_unregister);
>> +
>> +int failover_create(struct net_device *standby_dev, struct failover **pfailover)
>> +{
>> +	struct device *dev = standby_dev->dev.parent;
>> +	struct net_device *failover_dev;
>> +	int err;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	failover_dev = alloc_etherdev_mq(sizeof(struct failover_info), 16);
>> +	if (!failover_dev) {
>> +		dev_err(dev, "Unable to allocate failover_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(failover_dev, dev_net(standby_dev));
>> +	SET_NETDEV_DEV(failover_dev, dev);
>> +
>> +	failover_dev->netdev_ops = &failover_dev_ops;
>> +	failover_dev->ethtool_ops = &failover_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> +	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> +				       IFF_TX_SKB_SHARING);
>> +
>> +	/* don't acquire failover netdev's netif_tx_lock when transmitting */
>> +	failover_dev->features |= NETIF_F_LLTX;
>> +
>> +	/* Don't allow failover devices to change network namespaces. */
>> +	failover_dev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> +	failover_dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> +				    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> +				    NETIF_F_HIGHDMA | NETIF_F_LRO;
> OK but then you must make sure your primary and standby both
> support these features.

I guess i need to add something like bond_compute_features() to handle this.

>
>> +
>> +	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> +	failover_dev->features |= failover_dev->hw_features;
>> +
>> +	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
>> +	       failover_dev->addr_len);
>> +
>> +	failover_dev->min_mtu = standby_dev->min_mtu;
>> +	failover_dev->max_mtu = standby_dev->max_mtu;
> OK MTU is copied, fine. But is this always enough?
>
> How about e.g. hard_header_len? min_header_len? needed_headroom?
> needed_tailroom? I'd worry that even if you cover existing ones more
> might be added with time.  A function copying config between devices
> probably belongs in some central place IMHO.

ok. will add a function that handles these fields too.

>
>
>> +
>> +	err = register_netdev(failover_dev);
>> +	if (err < 0) {
>> +		dev_err(dev, "Unable to register failover_dev!\n");
>> +		goto err_register_netdev;
>> +	}
>> +
>> +	netif_carrier_off(failover_dev);
>> +
>> +	err = failover_register(failover_dev, NULL, pfailover);
>> +	if (err < 0)
>> +		goto err_failover;
>> +
>> +	return 0;
>> +
>> +err_failover:
>> +	unregister_netdev(failover_dev);
>> +err_register_netdev:
>> +	free_netdev(failover_dev);
>> +
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_create);
>> +
>> +void failover_destroy(struct failover *failover)
>> +{
>> +	struct net_device *failover_dev;
>> +	struct net_device *slave_dev;
>> +	struct failover_info *finfo;
>> +
>> +	if (!failover)
>> +		return;
>> +
>> +	failover_dev = rcu_dereference(failover->failover_dev);
>> +	finfo = netdev_priv(failover_dev);
>> +
>> +	netif_device_detach(failover_dev);
>> +
>> +	rtnl_lock();
>> +
>> +	slave_dev = rtnl_dereference(finfo->primary_dev);
>> +	if (slave_dev)
>> +		failover_slave_unregister(slave_dev);
>> +
>> +	slave_dev = rtnl_dereference(finfo->standby_dev);
>> +	if (slave_dev)
>> +		failover_slave_unregister(slave_dev);
>> +
>> +	failover_unregister(failover);
>> +
>> +	unregister_netdevice(failover_dev);
>> +
>> +	rtnl_unlock();
>> +
>> +	free_netdev(failover_dev);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_destroy);
>> +
>> +static __init int
>> +failover_init(void)
>> +{
>> +	register_netdevice_notifier(&failover_notifier);
>> +
>> +	return 0;
>> +}
>> +module_init(failover_init);
>> +
>> +static __exit
>> +void failover_exit(void)
>> +{
>> +	unregister_netdevice_notifier(&failover_notifier);
>> +}
>> +module_exit(failover_exit);
>> +
>> +MODULE_DESCRIPTION("Failover infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> -- 
>> 2.14.3

^ permalink raw reply

* Re: [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: Jonathan Corbet @ 2018-04-20 15:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Neal Cardwell,
	Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <1e7a78a6-27cd-9679-54d7-44d05484eda7@gmail.com>

On Thu, 19 Apr 2018 18:01:32 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> We can keep mmap() nice interface, granted we can add one hook like in following patch.
> 
> David, do you think such patch would be acceptable by lkml and mm/fs maintainers ?
> 
> Alternative would be implementing an ioctl() or getsockopt() operation,
> but it seems less natural...

So I have little standing here, but what the heck, not letting that bother
me has earned me a living for the last 20 years or so...:)

I think you should consider switching over to an interface where you
mmap() the region once, and use ioctl() to move the data into that region,
for a couple of reasons beyond the locking issues you've already found:

 - The "mmap() consumes data" semantics are a bit ... strange, IMO.
   That's not what mmap() normally does.  People expect ioctl() to do
   magic things, instead.

 - I would expect it to be a tiny bit faster, since you wouldn't be doing
   the VMA setup and teardown each time.

Thanks,

jon

^ permalink raw reply

* Re: [PATCH net-next 0/4] geneve: verify user specified MTU or adjust with a lower device
From: David Miller @ 2018-04-20 15:19 UTC (permalink / raw)
  To: alexey.kodanev; +Cc: netdev
In-Reply-To: <1524141752-25789-1-git-send-email-alexey.kodanev@oracle.com>

From: Alexey Kodanev <alexey.kodanev@oracle.com>
Date: Thu, 19 Apr 2018 15:42:28 +0300

> The first two patches don't introduce any functional changes and
> contain minor cleanups for code readability.
> 
> The last one adds a new function geneve_link_config() similar to the
> other tunnels. The function will be used on a new link creation or
> when 'remote' parameter is changed. It adjusts a user specified MTU
> or, if it finds a lower device, tunes the tunnel MTU using it.

Ok, series applied, thanks.

^ permalink raw reply

* [PATCH v3 net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Yafang Shao @ 2018-04-20 15:18 UTC (permalink / raw)
  To: eric.dumazet, alexei.starovoitov, songliubraving, davem
  Cc: netdev, linux-kernel, Yafang Shao

tcp_rcv_space_adjust is called every time data is copied to user space,
introducing a tcp tracepoint for which could show us when the packet is
copied to user.

When a tcp packet arrives, tcp_rcv_established() will be called and with
the existed tracepoint tcp_probe we could get the time when this packet
arrives.
Then this packet will be copied to user, and tcp_rcv_space_adjust will
be called and with this new introduced tracepoint we could get the time
when this packet is copied to user.
With these two tracepoints, we could figure out whether the user program
processes this packet immediately or there's latency.

Hence in the printk message, sk_cookie is printed as a key to relate
tcp_rcv_space_adjust with tcp_probe.

Maybe we could export sockfd in this new tracepoint as well, then we
could relate this new tracepoint with epoll/read/recv* tracepoints, and
finally that could show us the whole lifespan of this packet. But we
could also implement that with pid as these functions are executed in
process context.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

---
v2 -> v3: use sock_gen_cookie in tcp_event_sk as well.
          Maybe we could init sk_cookie in the stack then in other code
          path we just read it other than set it. If that is needed I
          will do it in another new patch.
v1 -> v2: use sk_cookie as a key suggested by Eric.
---
 include/trace/events/tcp.h | 30 +++++++++++++++++++++---------
 net/ipv4/tcp_input.c       |  2 ++
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 3dd6802..c1a5284 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -10,6 +10,7 @@
 #include <linux/tracepoint.h>
 #include <net/ipv6.h>
 #include <net/tcp.h>
+#include <linux/sock_diag.h>
 
 #define TP_STORE_V4MAPPED(__entry, saddr, daddr)		\
 	do {							\
@@ -113,7 +114,7 @@
  */
 DECLARE_EVENT_CLASS(tcp_event_sk,
 
-	TP_PROTO(const struct sock *sk),
+	TP_PROTO(struct sock *sk),
 
 	TP_ARGS(sk),
 
@@ -125,6 +126,7 @@
 		__array(__u8, daddr, 4)
 		__array(__u8, saddr_v6, 16)
 		__array(__u8, daddr_v6, 16)
+		__field(__u64, sock_cookie)
 	),
 
 	TP_fast_assign(
@@ -144,24 +146,34 @@
 
 		TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
 			       sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
+
+		__entry->sock_cookie = sock_gen_cookie(sk);
 	),
 
-	TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c",
+	TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c sock_cookie=%llx",
 		  __entry->sport, __entry->dport,
 		  __entry->saddr, __entry->daddr,
-		  __entry->saddr_v6, __entry->daddr_v6)
+		  __entry->saddr_v6, __entry->daddr_v6,
+		  __entry->sock_cookie)
 );
 
 DEFINE_EVENT(tcp_event_sk, tcp_receive_reset,
 
-	TP_PROTO(const struct sock *sk),
+	TP_PROTO(struct sock *sk),
 
 	TP_ARGS(sk)
 );
 
 DEFINE_EVENT(tcp_event_sk, tcp_destroy_sock,
 
-	TP_PROTO(const struct sock *sk),
+	TP_PROTO(struct sock *sk),
+
+	TP_ARGS(sk)
+);
+
+DEFINE_EVENT(tcp_event_sk, tcp_rcv_space_adjust,
+
+	TP_PROTO(struct sock *sk),
 
 	TP_ARGS(sk)
 );
@@ -232,6 +244,7 @@
 		__field(__u32, snd_wnd)
 		__field(__u32, srtt)
 		__field(__u32, rcv_wnd)
+		__field(__u64, sock_cookie)
 	),
 
 	TP_fast_assign(
@@ -256,15 +269,14 @@
 		__entry->rcv_wnd = tp->rcv_wnd;
 		__entry->ssthresh = tcp_current_ssthresh(sk);
 		__entry->srtt = tp->srtt_us >> 3;
+		__entry->sock_cookie = sock_gen_cookie(sk);
 	),
 
-	TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x "
-		  "snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u "
-		  "rcv_wnd=%u",
+	TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u rcv_wnd=%u sock_cookie=%llx",
 		  __entry->saddr, __entry->daddr, __entry->mark,
 		  __entry->length, __entry->snd_nxt, __entry->snd_una,
 		  __entry->snd_cwnd, __entry->ssthresh, __entry->snd_wnd,
-		  __entry->srtt, __entry->rcv_wnd)
+		  __entry->srtt, __entry->rcv_wnd, __entry->sock_cookie)
 );
 
 #endif /* _TRACE_TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f93687f..43ad468 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -582,6 +582,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 	u32 copied;
 	int time;
 
+	trace_tcp_rcv_space_adjust(sk);
+
 	tcp_mstamp_refresh(tp);
 	time = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcvq_space.time);
 	if (time < (tp->rcv_rtt_est.rtt_us >> 3) || tp->rcv_rtt_est.rtt_us == 0)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v3] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Yafang Shao @ 2018-04-20 15:14 UTC (permalink / raw)
  To: eric.dumazet, alexei.starovoitov, songliubraving, davem
  Cc: netdev, linux-kernel, Yafang Shao

tcp_rcv_space_adjust is called every time data is copied to user space,
introducing a tcp tracepoint for which could show us when the packet is
copied to user.

When a tcp packet arrives, tcp_rcv_established() will be called and with
the existed tracepoint tcp_probe we could get the time when this packet
arrives.
Then this packet will be copied to user, and tcp_rcv_space_adjust will
be called and with this new introduced tracepoint we could get the time
when this packet is copied to user.
With these two tracepoints, we could figure out whether the user program
processes this packet immediately or there's latency.

Hence in the printk message, sk_cookie is printed as a key to relate
tcp_rcv_space_adjust with tcp_probe.

Maybe we could export sockfd in this new tracepoint as well, then we
could relate this new tracepoint with epoll/read/recv* tracepoints, and
finally that could show us the whole lifespan of this packet. But we
could also implement that with pid as these functions are executed in
process context.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

---
v2 -> v3: use sock_gen_cookie in tcp_event_sk as well.
          Maybe we could init sk_cookie in the stack then in other code
          path we just read it other than set it. If that is needed I
          will do it in another new patch.
v1 -> v2: use sk_cookie as a key suggested by Eric.

---
 include/trace/events/tcp.h | 30 +++++++++++++++++++++---------
 net/ipv4/tcp_input.c       |  2 ++
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 3dd6802..c820bf6 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -10,6 +10,7 @@
 #include <linux/tracepoint.h>
 #include <net/ipv6.h>
 #include <net/tcp.h>
+#include <linux/sock_diag.h>
 
 #define TP_STORE_V4MAPPED(__entry, saddr, daddr)		\
 	do {							\
@@ -113,7 +114,7 @@
  */
 DECLARE_EVENT_CLASS(tcp_event_sk,
 
-	TP_PROTO(const struct sock *sk),
+	TP_PROTO(struct sock *sk),
 
 	TP_ARGS(sk),
 
@@ -125,6 +126,7 @@
 		__array(__u8, daddr, 4)
 		__array(__u8, saddr_v6, 16)
 		__array(__u8, daddr_v6, 16)
+		__field(__u64, sock_cookie)
 	),
 
 	TP_fast_assign(
@@ -144,24 +146,34 @@
 
 		TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
 			       sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
+
+		__entry->sock_cookie = sock_gen_cookie(sk);
 	),
 
-	TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c",
+	TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c sock_cookie=%llu",
 		  __entry->sport, __entry->dport,
 		  __entry->saddr, __entry->daddr,
-		  __entry->saddr_v6, __entry->daddr_v6)
+		  __entry->saddr_v6, __entry->daddr_v6,
+		  __entry->sock_cookie)
 );
 
 DEFINE_EVENT(tcp_event_sk, tcp_receive_reset,
 
-	TP_PROTO(const struct sock *sk),
+	TP_PROTO(struct sock *sk),
 
 	TP_ARGS(sk)
 );
 
 DEFINE_EVENT(tcp_event_sk, tcp_destroy_sock,
 
-	TP_PROTO(const struct sock *sk),
+	TP_PROTO(struct sock *sk),
+
+	TP_ARGS(sk)
+);
+
+DEFINE_EVENT(tcp_event_sk, tcp_rcv_space_adjust,
+
+	TP_PROTO(struct sock *sk),
 
 	TP_ARGS(sk)
 );
@@ -232,6 +244,7 @@
 		__field(__u32, snd_wnd)
 		__field(__u32, srtt)
 		__field(__u32, rcv_wnd)
+		__field(__u64, sock_cookie)
 	),
 
 	TP_fast_assign(
@@ -256,15 +269,14 @@
 		__entry->rcv_wnd = tp->rcv_wnd;
 		__entry->ssthresh = tcp_current_ssthresh(sk);
 		__entry->srtt = tp->srtt_us >> 3;
+		__entry->sock_cookie = sock_gen_cookie(sk);
 	),
 
-	TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x "
-		  "snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u "
-		  "rcv_wnd=%u",
+	TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u rcv_wnd=%u sock_cookie=%llx",
 		  __entry->saddr, __entry->daddr, __entry->mark,
 		  __entry->length, __entry->snd_nxt, __entry->snd_una,
 		  __entry->snd_cwnd, __entry->ssthresh, __entry->snd_wnd,
-		  __entry->srtt, __entry->rcv_wnd)
+		  __entry->srtt, __entry->rcv_wnd, __entry->sock_cookie)
 );
 
 #endif /* _TRACE_TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f93687f..43ad468 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -582,6 +582,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 	u32 copied;
 	int time;
 
+	trace_tcp_rcv_space_adjust(sk);
+
 	tcp_mstamp_refresh(tp);
 	time = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcvq_space.time);
 	if (time < (tp->rcv_rtt_est.rtt_us >> 3) || tp->rcv_rtt_est.rtt_us == 0)
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next v3] team: account for oper state
From: David Miller @ 2018-04-20 15:13 UTC (permalink / raw)
  To: gwilkie; +Cc: jiri, netdev
In-Reply-To: <20180419103414.542-1-gwilkie@vyatta.att-mail.com>

From: George Wilkie <gwilkie@vyatta.att-mail.com>
Date: Thu, 19 Apr 2018 11:34:14 +0100

> Account for operational state when determining port linkup state,
> as per Documentation/networking/operstates.txt.
> 
> Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>

Applied.

^ permalink raw reply

* Re: [net-next 0/3] tipc: Confgiuration of MTU for media UDP
From: David Miller @ 2018-04-20 15:04 UTC (permalink / raw)
  To: mohan.krishna.ghanta.krishnamurthy
  Cc: tipc-discussion, jon.maloy, maloy, ying.xue, netdev
In-Reply-To: <1524128780-2550-1-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>

From: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Date: Thu, 19 Apr 2018 11:06:17 +0200

> Systematic measurements have shown that an emulated MTU of 14k for
> UDP bearers is the optimal value for maximal throughput. Accordingly,
> the default MTU of UDP bearers is changed to 14k.
> 
> We also provide users with a fallback option from this value,
> by providing support to configure MTU for UDP bearers. The following
> options are introduced which are symmetrical to the design of
> confguring link tolerance.
> 
> - Configure media with new MTU value, which will take effect on
> links going up after the moment it was configured. Alternatively,
> the bearer has to be disabled and re-enabled, for existing links to
> reflect the configured value.
> 
> - Configure bearer with new MTU value, which take effect on 
> running links dynamically.
> 
> Please note:
> - User has to change MTU at both endpoints, otherwise the link 
> will fall back to smallest MTU after a reset.
> - Failover from a link with higher MTU to a link with lower MTU

There are many negatives to using UDP in a way which causes
fragmentation, like this code now does.

But whatever, you guys can do whatever you want and get to keep the
pieces I guess :-)

Series applied.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox