* Re: [PATCH net-next] net/ipv6: Fix ip6_convert_metrics() bug
From: David Miller @ 2018-04-20 15:36 UTC (permalink / raw)
To: edumazet; +Cc: netdev, eric.dumazet, dsa
In-Reply-To: <20180419161453.46977-1-edumazet@google.com>
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 19 Apr 2018 09:14:53 -0700
> If ip6_convert_metrics() fails to allocate memory, it should not
> overwrite rt->fib6_metrics or we risk a crash later as syzbot found.
...
> Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
Applied, thanks Eric.
^ permalink raw reply
* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Michael S. Tsirkin @ 2018-04-20 15:34 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh, jiri
In-Reply-To: <d1bf0771-751b-85bc-4890-8232283131a2@intel.com>
On Fri, Apr 20, 2018 at 08:21:00AM -0700, Samudrala, Sridhar wrote:
> > > + finfo = netdev_priv(failover_dev);
> > > +
> > > + primary_dev = rtnl_dereference(finfo->primary_dev);
> > > + standby_dev = rtnl_dereference(finfo->standby_dev);
> > > +
> > > + if (slave_dev != primary_dev && slave_dev != standby_dev)
> > > + goto done;
> > > +
> > > + if ((primary_dev && failover_xmit_ready(primary_dev)) ||
> > > + (standby_dev && failover_xmit_ready(standby_dev))) {
> > > + netif_carrier_on(failover_dev);
> > > + netif_tx_wake_all_queues(failover_dev);
> > > + } else {
> > > + netif_carrier_off(failover_dev);
> > > + netif_tx_stop_all_queues(failover_dev);
> > And I think it's a good idea to get stats from device here too.
>
> Not sure why we need to get stats from lower devs here?
link down is often indication of a hardware problem.
lower dev might stop responding down the road.
> > > +static const struct net_device_ops failover_dev_ops = {
> > > + .ndo_open = failover_open,
> > > + .ndo_stop = failover_close,
> > > + .ndo_start_xmit = failover_start_xmit,
> > > + .ndo_select_queue = failover_select_queue,
> > > + .ndo_get_stats64 = failover_get_stats,
> > > + .ndo_change_mtu = failover_change_mtu,
> > > + .ndo_set_rx_mode = failover_set_rx_mode,
> > > + .ndo_validate_addr = eth_validate_addr,
> > > + .ndo_features_check = passthru_features_check,
> > xdp support?
>
> I think it should be possible to add it be calling the lower dev ndo_xdp routines
> with proper checks. can we add this later?
I'd be concerned that if you don't xdp userspace will keep poking
at lower devs. Then it will stop working if you add this later.
--
MST
^ permalink raw reply
* Re: [PATCH net-next v2 2/4] net/smc: handle sockopt TCP_NODELAY
From: David Miller @ 2018-04-20 15:31 UTC (permalink / raw)
To: ubraun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180419135655.3058-3-ubraun@linux.ibm.com>
From: Ursula Braun <ubraun@linux.ibm.com>
Date: Thu, 19 Apr 2018 15:56:53 +0200
> @@ -1412,6 +1523,10 @@ static int smc_create(struct net *net, struct socket *sock, int protocol,
> sk_common_release(sk);
> goto out;
> }
> + /* clc handshake should run with disabled Nagle algorithm */
> + rc = kernel_setsockopt(smc->clcsock, SOL_TCP, TCP_NODELAY, (char *)&val,
> + sizeof(val));
> + smc->deferred_nodelay_reset = 1; /* TCP_NODELAY is not the default */
> smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
> smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
This is not what I asked for.
If the user asked for the socket option, you are unconditionally returning success
to the original user.
If it fails here during the smc create, it's too late to handle the error properly.
This is the problem you must solve before I can take these changes properly.
You aren't even really failing the smc_create() here, because if you were you
would kill and free up the clcsock and release 'sk'.
As it stands now you are returning an error, and not releasing resources, if
the kernel_setsockopt() fails.
But more fundamentally these semantics are terrible. You must not ever create
a situation where you tell the user his setsockopt succeeded by in the end
not honoring that reqeust fully. That is what your current changes allow to
happen.
^ permalink raw reply
* Re: [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Michael S. Tsirkin @ 2018-04-20 15:28 UTC (permalink / raw)
To: Alexander Duyck
Cc: Daly, Dan, Bjorn Helgaas, Duyck, Alexander H, linux-pci,
virtio-dev, kvm, Netdev, LKML, linux-nvme, Keith Busch, netanel,
Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <CAKgT0Ucpw9D1u8MYoEX5KOij6sxs-5O9gPFTmOVq=Dgt=HUJ=g@mail.gmail.com>
On Fri, Apr 20, 2018 at 07:56:14AM -0700, Alexander Duyck wrote:
> > I think for virtio it should include the feature bit, yes.
> > Adding feature bit is very easy - post a patch to the virtio TC mailing
> > list, wait about a week to give people time to respond (two weeks if it
> > is around holidays and such).
>
> The problem is we are talking about hardware/FPGA, not software.
> Adding a feature bit means going back and updating RTL. The software
> side of things is easy, re-validating things after a hardware/FPGA
> change not so much.
>
> If this is a hard requirement I may just drop the virtio patch, push
> what I have, and leave it to Mark/Dan to deal with the necessary RTL
> and code changes needed to support Virtio as I don't expect the
> turnaround to be as easy as just a patch.
>
> Thanks.
>
> - Alex
Let's focus on virtio in this thread.
Involving the virtio TC in host/guest interface changes is a
hard requirement. It's just too easy to create conflicts otherwise.
So you guys should have just sent the proposal to the TC when you
were doing your RTL and you would have been in the clear.
Generally adding a feature bit with any extension is a good idea:
this way you merely reserve a feature bit for your feature through
the TC and are more or less sure of forward and backward compatibility.
It's incredibly easy.
But maybe it's not needed here. I am not making the decisions myself.
Not too late: post to the TC list and let's see what the response is.
Without a feature bit you are making a change affecting all future
implementations without exception so the bar is a bit higher: you need
to actually post a spec text proposal not just a patch showing how to
use the feature, and TC needs to vote on it. Voting takes a week,
review a week or two depending on change complexity.
Hope this helps,
--
MST
^ permalink raw reply
* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-20 15:28 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: mst, davem, netdev, virtualization, virtio-dev, jesse.brandeburg,
alexander.h.duyck, kubakici, jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-5-git-send-email-sridhar.samudrala@intel.com>
On Thu, 19 Apr 2018 18:42:04 -0700
Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> Use the registration/notification framework supported by the generic
> failover infrastructure.
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Do what you want to other devices but leave netvsc alone.
Adding these failover ops does not reduce the code size, and really is
no benefit. The netvsc device driver needs to be backported to several
other distributions and doing this makes that harder.
I will NAK patches to change to common code for netvsc especially the
three device model. MS worked hard with distro vendors to support transparent
mode, ans we really can't have a new model; or do backport.
Plus, DPDK is now dependent on existing model.
^ permalink raw reply
* Re: [PATCH net-next 0/2] qed* : Use trust mode to override forced MAC
From: David Miller @ 2018-04-20 15:26 UTC (permalink / raw)
To: shahed.shaikh; +Cc: netdev, Ariel.Elior, Dept-EngEverestLinuxL2
In-Reply-To: <20180419125012.20503-1-shahed.shaikh@cavium.com>
From: Shahed Shaikh <shahed.shaikh@cavium.com>
Date: Thu, 19 Apr 2018 05:50:10 -0700
> This patchset adds a support to override forced MAC (MAC set by PF for a VF)
> when trust mode is enabled using
> #ip link set dev <pf> vf <vf id> trust on
>
> First patch adds a real change to use .ndo_set_vf_trust to override forced MAC
> and allow user to change VFs from VF interface itself.
>
> Second patch takes care of a corner case, where MAC change from VF won't
> take effect when VF interface is down, by introducing a new TLV
> (a way to send message from VF to PF) to give a hint to PF to update
> its bulletin board.
>
> Please apply this series to net-next.
Series applied.
^ permalink raw reply
* Re: [PATCH v3 net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Yafang Shao @ 2018-04-20 15:24 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, Alexei Starovoitov, Song Liu, netdev, LKML
In-Reply-To: <20180420.112106.514748261967848944.davem@davemloft.net>
On Fri, Apr 20, 2018 at 11:21 PM, David Miller <davem@davemloft.net> wrote:
>
> Why are you sending this same patch twice?
>
> Thank you.
Some mistake.
Sorry about that.
Pls. use the second patch.
Thanks
Yafang
^ permalink raw reply
* Re: [net-next 1/3] tipc: set default MTU for UDP media
From: kbuild test robot @ 2018-04-20 15:22 UTC (permalink / raw)
To: GhantaKrishnamurthy MohanKrishna
Cc: kbuild-all, tipc-discussion, jon.maloy, maloy, ying.xue,
mohan.krishna.ghanta.krishnamurthy, netdev, davem
In-Reply-To: <1524128780-2550-2-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>
[-- Attachment #1: Type: text/plain, Size: 5389 bytes --]
Hi GhantaKrishnamurthy,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on net-next/master]
url: https://github.com/0day-ci/linux/commits/GhantaKrishnamurthy-MohanKrishna/tipc-Confgiuration-of-MTU-for-media-UDP/20180420-224412
config: x86_64-randconfig-x018-201815 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
Note: the linux-review/GhantaKrishnamurthy-MohanKrishna/tipc-Confgiuration-of-MTU-for-media-UDP/20180420-224412 HEAD 5757244a45c9114ee8a7ed60e9b074107605f6eb builds fine.
It only hurts bisectibility.
All errors (new ones prefixed by >>):
net//tipc/udp_media.c: In function 'tipc_udp_enable':
>> net//tipc/udp_media.c:716:20: error: 'struct tipc_media' has no member named 'mtu'
b->mtu = b->media->mtu;
^~
net//tipc/udp_media.c: At top level:
net//tipc/udp_media.c:805:3: error: 'struct tipc_media' has no member named 'mtu'
.mtu = TIPC_DEF_LINK_UDP_MTU,
^~~
vim +716 net//tipc/udp_media.c
632
633 /**
634 * tipc_udp_enable - callback to create a new udp bearer instance
635 * @net: network namespace
636 * @b: pointer to generic tipc_bearer
637 * @attrs: netlink bearer configuration
638 *
639 * validate the bearer parameters and initialize the udp bearer
640 * rtnl_lock should be held
641 */
642 static int tipc_udp_enable(struct net *net, struct tipc_bearer *b,
643 struct nlattr *attrs[])
644 {
645 int err = -EINVAL;
646 struct udp_bearer *ub;
647 struct udp_media_addr remote = {0};
648 struct udp_media_addr local = {0};
649 struct udp_port_cfg udp_conf = {0};
650 struct udp_tunnel_sock_cfg tuncfg = {NULL};
651 struct nlattr *opts[TIPC_NLA_UDP_MAX + 1];
652 u8 node_id[NODE_ID_LEN] = {0,};
653
654 ub = kzalloc(sizeof(*ub), GFP_ATOMIC);
655 if (!ub)
656 return -ENOMEM;
657
658 INIT_LIST_HEAD(&ub->rcast.list);
659
660 if (!attrs[TIPC_NLA_BEARER_UDP_OPTS])
661 goto err;
662
663 if (nla_parse_nested(opts, TIPC_NLA_UDP_MAX,
664 attrs[TIPC_NLA_BEARER_UDP_OPTS],
665 tipc_nl_udp_policy, NULL))
666 goto err;
667
668 if (!opts[TIPC_NLA_UDP_LOCAL] || !opts[TIPC_NLA_UDP_REMOTE]) {
669 pr_err("Invalid UDP bearer configuration");
670 err = -EINVAL;
671 goto err;
672 }
673
674 err = tipc_parse_udp_addr(opts[TIPC_NLA_UDP_LOCAL], &local,
675 &ub->ifindex);
676 if (err)
677 goto err;
678
679 err = tipc_parse_udp_addr(opts[TIPC_NLA_UDP_REMOTE], &remote, NULL);
680 if (err)
681 goto err;
682
683 /* Autoconfigure own node identity if needed */
684 if (!tipc_own_id(net)) {
685 memcpy(node_id, local.ipv6.in6_u.u6_addr8, 16);
686 tipc_net_init(net, node_id, 0);
687 }
688 if (!tipc_own_id(net)) {
689 pr_warn("Failed to set node id, please configure manually\n");
690 err = -EINVAL;
691 goto err;
692 }
693
694 b->bcast_addr.media_id = TIPC_MEDIA_TYPE_UDP;
695 b->bcast_addr.broadcast = TIPC_BROADCAST_SUPPORT;
696 rcu_assign_pointer(b->media_ptr, ub);
697 rcu_assign_pointer(ub->bearer, b);
698 tipc_udp_media_addr_set(&b->addr, &local);
699 if (local.proto == htons(ETH_P_IP)) {
700 struct net_device *dev;
701
702 dev = __ip_dev_find(net, local.ipv4.s_addr, false);
703 if (!dev) {
704 err = -ENODEV;
705 goto err;
706 }
707 udp_conf.family = AF_INET;
708 udp_conf.local_ip.s_addr = htonl(INADDR_ANY);
709 udp_conf.use_udp_checksums = false;
710 ub->ifindex = dev->ifindex;
711 if (tipc_mtu_bad(dev, sizeof(struct iphdr) +
712 sizeof(struct udphdr))) {
713 err = -EINVAL;
714 goto err;
715 }
> 716 b->mtu = b->media->mtu;
717 #if IS_ENABLED(CONFIG_IPV6)
718 } else if (local.proto == htons(ETH_P_IPV6)) {
719 udp_conf.family = AF_INET6;
720 udp_conf.use_udp6_tx_checksums = true;
721 udp_conf.use_udp6_rx_checksums = true;
722 udp_conf.local_ip6 = in6addr_any;
723 b->mtu = 1280;
724 #endif
725 } else {
726 err = -EAFNOSUPPORT;
727 goto err;
728 }
729 udp_conf.local_udp_port = local.port;
730 err = udp_sock_create(net, &udp_conf, &ub->ubsock);
731 if (err)
732 goto err;
733 tuncfg.sk_user_data = ub;
734 tuncfg.encap_type = 1;
735 tuncfg.encap_rcv = tipc_udp_recv;
736 tuncfg.encap_destroy = NULL;
737 setup_udp_tunnel_sock(net, ub->ubsock, &tuncfg);
738
739 /**
740 * The bcast media address port is used for all peers and the ip
741 * is used if it's a multicast address.
742 */
743 memcpy(&b->bcast_addr.value, &remote, sizeof(remote));
744 if (tipc_udp_is_mcast_addr(&remote))
745 err = enable_mcast(ub, &remote);
746 else
747 err = tipc_udp_rcast_add(b, &remote);
748 if (err)
749 goto err;
750
751 return 0;
752 err:
753 if (ub->ubsock)
754 udp_tunnel_sock_release(ub->ubsock);
755 kfree(ub);
756 return err;
757 }
758
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27623 bytes --]
^ permalink raw reply
* Re: [PATCH 4/8] sh_eth: Change platform check to CONFIG_ARCH_RENESAS
From: Sergei Shtylyov @ 2018-04-20 15:22 UTC (permalink / raw)
To: Geert Uytterhoeven, Simon Horman, Magnus Damm, Russell King,
Catalin Marinas, Will Deacon, Dan Williams, Vinod Koul,
Mauro Carvalho Chehab, David S . Miller, Greg Kroah-Hartman,
Liam Girdwood, Mark Brown, Jaroslav Kysela, Takashi Iwai,
Arnd Bergmann, Kuninori Morimoto, Laurent Pinchart
Cc: devel, alsa-devel, netdev, linux-kernel, linux-renesas-soc,
dmaengine, linux-arm-kernel, linux-media
In-Reply-To: <1524230914-10175-5-git-send-email-geert+renesas@glider.be>
On 04/20/2018 04:28 PM, Geert Uytterhoeven wrote:
> Since commit 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce ARCH_RENESAS")
> is CONFIG_ARCH_RENESAS a more appropriate platform check than the legacy
> CONFIG_ARCH_SHMOBILE, hence use the former.
>
> Renesas SuperH SH-Mobile SoCs are still covered by the CONFIG_CPU_SH4
> check.
>
> This will allow to drop ARCH_SHMOBILE on ARM and ARM64 in the near
> future.
>
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
[...]
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
MBR, Sergei
^ permalink raw reply
* Re: [PATCH v3 net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: David Miller @ 2018-04-20 15:21 UTC (permalink / raw)
To: laoar.shao
Cc: eric.dumazet, alexei.starovoitov, songliubraving, netdev,
linux-kernel
In-Reply-To: <1524237506-11011-1-git-send-email-laoar.shao@gmail.com>
Why are you sending this same patch twice?
Thank you.
^ permalink raw reply
* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Samudrala, Sridhar @ 2018-04-20 15:21 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh, jiri
In-Reply-To: <20180420045657-mutt-send-email-mst@kernel.org>
On 4/19/2018 7:44 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 19, 2018 at 06:42:02PM -0700, Sridhar Samudrala wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation.
>>
>> It exposes 2 sets of interfaces to the paravirtual drivers.
>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> master netdev is created. The paravirtual driver registers each instance
>> of netvsc as a 'failover' instance along with a set of ops to manage the
>> slave events.
>> failover_register()
>> failover_unregister()
>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> the failover module provides interfaces to create/destroy additional master
>> netdev and all the slave events are managed internally.
>> failover_create()
>> failover_destroy()
>> These functions call failover_register()/failover_unregister() with the
>> master netdev created by the failover module.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> I like this patch. Yes something to improve (see below)
>
>> ---
>> include/linux/netdevice.h | 16 +
>> include/net/failover.h | 96 ++++++
>> net/Kconfig | 18 +
>> net/core/Makefile | 1 +
>> net/core/failover.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 975 insertions(+)
>> create mode 100644 include/net/failover.h
>> create mode 100644 net/core/failover.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cf44503ea81a..ed535b6724e1 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1401,6 +1401,8 @@ struct net_device_ops {
>> * entity (i.e. the master device for bridged veth)
>> * @IFF_MACSEC: device is a MACsec device
>> * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
>> + * @IFF_FAILOVER: device is a failover master device
>> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>> */
>> enum netdev_priv_flags {
>> IFF_802_1Q_VLAN = 1<<0,
>> @@ -1430,6 +1432,8 @@ enum netdev_priv_flags {
>> IFF_PHONY_HEADROOM = 1<<24,
>> IFF_MACSEC = 1<<25,
>> IFF_NO_RX_HANDLER = 1<<26,
>> + IFF_FAILOVER = 1<<27,
>> + IFF_FAILOVER_SLAVE = 1<<28,
>> };
>>
>> #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>> @@ -1458,6 +1462,8 @@ enum netdev_priv_flags {
>> #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
>> #define IFF_MACSEC IFF_MACSEC
>> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>> +#define IFF_FAILOVER IFF_FAILOVER
>> +#define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
>>
>> /**
>> * struct net_device - The DEVICE structure.
>> @@ -4308,6 +4314,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> }
>>
>> +static inline bool netif_is_failover(const struct net_device *dev)
>> +{
>> + return dev->priv_flags & IFF_FAILOVER;
>> +}
>> +
>> +static inline bool netif_is_failover_slave(const struct net_device *dev)
>> +{
>> + return dev->priv_flags & IFF_FAILOVER_SLAVE;
>> +}
>> +
>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> static inline void netif_keep_dst(struct net_device *dev)
>> {
>> diff --git a/include/net/failover.h b/include/net/failover.h
>> new file mode 100644
>> index 000000000000..0b8601043d90
>> --- /dev/null
>> +++ b/include/net/failover.h
>> @@ -0,0 +1,96 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_FAILOVER_H
>> +#define _NET_FAILOVER_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct failover_ops {
>> + int (*slave_pre_register)(struct net_device *slave_dev,
>> + struct net_device *failover_dev);
>> + int (*slave_join)(struct net_device *slave_dev,
>> + struct net_device *failover_dev);
>> + int (*slave_pre_unregister)(struct net_device *slave_dev,
>> + struct net_device *failover_dev);
>> + int (*slave_release)(struct net_device *slave_dev,
>> + struct net_device *failover_dev);
>> + int (*slave_link_change)(struct net_device *slave_dev,
>> + struct net_device *failover_dev);
>> + rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct failover {
>> + struct list_head list;
>> + struct net_device __rcu *failover_dev;
>> + struct failover_ops __rcu *ops;
>> +};
>> +
>> +/* failover state */
>> +struct failover_info {
>> + /* primary netdev with same MAC */
>> + struct net_device __rcu *primary_dev;
>> +
>> + /* standby netdev */
>> + struct net_device __rcu *standby_dev;
>> +
>> + /* primary netdev stats */
>> + struct rtnl_link_stats64 primary_stats;
>> +
>> + /* standby netdev stats */
>> + struct rtnl_link_stats64 standby_stats;
>> +
>> + /* aggregated stats */
>> + struct rtnl_link_stats64 failover_stats;
>> +
>> + /* spinlock while updating stats */
>> + spinlock_t stats_lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_FAILOVER)
>> +
>> +int failover_create(struct net_device *standby_dev,
>> + struct failover **pfailover);
>> +void failover_destroy(struct failover *failover);
>> +
>> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
>> + struct failover **pfailover);
>> +void failover_unregister(struct failover *failover);
>> +
>> +int failover_slave_unregister(struct net_device *slave_dev);
>> +
>> +#else
>> +
>> +static inline
>> +int failover_create(struct net_device *standby_dev,
>> + struct failover **pfailover);
>> +{
>> + return 0;
>> +}
>> +
>> +static inline
>> +void failover_destroy(struct failover *failover)
>> +{
>> +}
>> +
>> +static inline
>> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
>> + struct pfailover **pfailover);
>> +{
>> + return 0;
>> +}
>> +
>> +static inline
>> +void failover_unregister(struct failover *failover)
>> +{
>> +}
>> +
>> +static inline
>> +int failover_slave_unregister(struct net_device *slave_dev)
>> +{
>> + return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_FAILOVER_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..388b99dfee10 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> devlink is a loadable module and the driver using it is built-in.
>>
>> +config NET_FAILOVER
>> + tristate "Failover interface"
>> + help
>> + This provides a generic interface for paravirtual drivers to listen
>> + for netdev register/unregister/link change events from pci ethernet
>> + devices with the same MAC and takeover their datapath. This also
>> + enables live migration of a VM with direct attached VF by failing
>> + over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_FAILOVER
>> + tristate
>> + default m if NET_FAILOVER=m
>> + default y if NET_FAILOVER=y || NET_FAILOVER=n
>> + help
>> + Drivers using the failover infrastructure should have a dependency
>> + on MAY_USE_FAILOVER to ensure they do not cause link errors when
>> + failover is a loadable module and the driver using it is built-in.
>> +
>> endif # if NET
>>
>> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..cef17518bb7d 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> obj-$(CONFIG_HWBM) += hwbm.o
>> obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_FAILOVER) += failover.o
>> diff --git a/net/core/failover.c b/net/core/failover.c
>> new file mode 100644
>> index 000000000000..7bee762cb737
>> --- /dev/null
>> +++ b/net/core/failover.c
>> @@ -0,0 +1,844 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>
> I think you should copy the header from bond_main.c upon which
> some of the code seems to be based.
Yes. some of the code is based on bonding/team and netvsc. i added a reference
to netvsc in the comment, will also include bonding/team driver.
>
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/pci.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/failover.h>
>> +
>> +static LIST_HEAD(failover_list);
>> +static DEFINE_SPINLOCK(failover_lock);
>> +
>> +static int failover_slave_pre_register(struct net_device *slave_dev,
>> + struct net_device *failover_dev,
>> + struct failover_ops *failover_ops)
>> +{
>> + struct failover_info *finfo;
>> + bool standby;
>> +
>> + if (failover_ops) {
>> + if (!failover_ops->slave_pre_register)
>> + return -EINVAL;
>> +
>> + return failover_ops->slave_pre_register(slave_dev,
>> + failover_dev);
>> + }
>> +
>> + finfo = netdev_priv(failover_dev);
>> + standby = (slave_dev->dev.parent == failover_dev->dev.parent);
>> + if (standby ? rtnl_dereference(finfo->standby_dev) :
>> + rtnl_dereference(finfo->primary_dev)) {
>> + netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
>> + slave_dev->name, standby ? "standby" : "primary");
>> + return -EEXIST;
>> + }
>> +
>> + /* Avoid non pci devices as primary netdev */
> Why? Pls change this comment so it explains the motivation
> rather than just repeat what the code does.
OK.
>
>> + if (!standby && (!slave_dev->dev.parent ||
>> + !dev_is_pci(slave_dev->dev.parent)))
>> + return -EINVAL;
>> +
>> + return 0;
>> +}
>> +
>> +static int failover_slave_join(struct net_device *slave_dev,
>> + struct net_device *failover_dev,
>> + struct failover_ops *failover_ops)
>> +{
>> + struct failover_info *finfo;
>> + int err, orig_mtu;
>> + bool standby;
>> +
>> + if (failover_ops) {
>> + if (!failover_ops->slave_join)
>> + return -EINVAL;
>> +
>> + return failover_ops->slave_join(slave_dev, failover_dev);
>> + }
>> +
>> + if (netif_running(failover_dev)) {
>> + err = dev_open(slave_dev);
>> + if (err && (err != -EBUSY)) {
>> + netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
>> + slave_dev->name, err);
>> + goto err_dev_open;
>> + }
>> + }
>> +
>> + /* Align MTU of slave with failover dev */
>> + orig_mtu = slave_dev->mtu;
> I suspect this was copied from bond. this variable is never
> used and I'm even surprised gcc did not warn about this.
Thanks for catching, I broke this when i moved the dev_open() and dev_set_mtu()
calls from register to join. I need to reset slave_dev mtu to orig_mtu on failure.
>
>
>> + err = dev_set_mtu(slave_dev, failover_dev->mtu);
> How do we know slave supports this MTU? same applies to
> failover_change_mtu.
The err check below should catch it and we will reset the mtu back and
fail the join/register.
>
>
>
>
>> + if (err) {
>> + netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
>> + slave_dev->name, failover_dev->mtu);
>> + goto err_set_mtu;
>> + }
>> +
>> + finfo = netdev_priv(failover_dev);
>> + standby = (slave_dev->dev.parent == failover_dev->dev.parent);
>> +
>> + dev_hold(slave_dev);
>> +
>> + if (standby) {
>> + rcu_assign_pointer(finfo->standby_dev, slave_dev);
>> + dev_get_stats(finfo->standby_dev, &finfo->standby_stats);
>> + } else {
>> + rcu_assign_pointer(finfo->primary_dev, slave_dev);
>> + dev_get_stats(finfo->primary_dev, &finfo->primary_stats);
>> + failover_dev->min_mtu = slave_dev->min_mtu;
>> + failover_dev->max_mtu = slave_dev->max_mtu;
>> + }
>> +
>> + netdev_info(failover_dev, "failover slave:%s joined\n",
>> + slave_dev->name);
>> +
>> + return 0;
>> +
>> +err_set_mtu:
>> + dev_close(slave_dev);
>> +err_dev_open:
>> + return err;
>> +}
>> +
>> +/* Called when slave dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t failover_handle_frame(struct sk_buff **pskb)
>> +{
>> + struct sk_buff *skb = *pskb;
>> + struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> + skb->dev = ndev;
>> +
>> + return RX_HANDLER_ANOTHER;
>> +}
>> +
>> +static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
>> +{
>> + struct net_device *failover_dev;
>> + struct failover *failover;
>> +
>> + spin_lock(&failover_lock);
>> + list_for_each_entry(failover, &failover_list, list) {
>> + failover_dev = rtnl_dereference(failover->failover_dev);
>> + if (ether_addr_equal(failover_dev->perm_addr, mac)) {
>> + *ops = rtnl_dereference(failover->ops);
>> + spin_unlock(&failover_lock);
>> + return failover_dev;
>> + }
>> + }
>> + spin_unlock(&failover_lock);
>> + return NULL;
>> +}
>> +
>> +static int failover_slave_register(struct net_device *slave_dev)
>> +{
>> + struct failover_ops *failover_ops;
>> + struct net_device *failover_dev;
>> + int ret;
>> +
>> + ASSERT_RTNL();
>> +
>> + failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> + if (!failover_dev)
>> + goto done;
>> +
>> + ret = failover_slave_pre_register(slave_dev, failover_dev,
>> + failover_ops);
>> + if (ret)
>> + goto done;
>> +
>> + ret = netdev_rx_handler_register(slave_dev, failover_ops ?
>> + failover_ops->handle_frame :
>> + failover_handle_frame, failover_dev);
>> + if (ret) {
>> + netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
>> + ret);
>> + goto done;
>> + }
>> +
>> + ret = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> + if (ret) {
>> + netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
>> + failover_dev->name, ret);
>> + goto upper_link_failed;
>> + }
>> +
>> + slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
>> +
>> + ret = failover_slave_join(slave_dev, failover_dev, failover_ops);
>> + if (ret)
>> + goto err_join;
>> +
>> + call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
>> +
>> + netdev_info(failover_dev, "failover slave:%s registered\n",
>> + slave_dev->name);
>> +
>> + goto done;
>> +
>> +err_join:
>> + netdev_upper_dev_unlink(slave_dev, failover_dev);
>> + slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
>> +upper_link_failed:
>> + netdev_rx_handler_unregister(slave_dev);
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +
>> +static int failover_slave_pre_unregister(struct net_device *slave_dev,
>> + struct net_device *failover_dev,
>> + struct failover_ops *failover_ops)
>> +{
>> + struct net_device *standby_dev, *primary_dev;
>> + struct failover_info *finfo;
>> +
>> + if (failover_ops) {
>> + if (!failover_ops->slave_pre_unregister)
>> + return -EINVAL;
>> +
>> + return failover_ops->slave_pre_unregister(slave_dev,
>> + failover_dev);
>> + }
>> +
>> + finfo = netdev_priv(failover_dev);
>> + primary_dev = rtnl_dereference(finfo->primary_dev);
>> + standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> + if (slave_dev != primary_dev && slave_dev != standby_dev)
>> + return -EINVAL;
>> +
>> + return 0;
>> +}
>> +
>> +static int failover_slave_release(struct net_device *slave_dev,
>> + struct net_device *failover_dev,
>> + struct failover_ops *failover_ops)
>> +{
>> + struct net_device *standby_dev, *primary_dev;
>> + struct failover_info *finfo;
>> +
>> + if (failover_ops) {
>> + if (!failover_ops->slave_release)
>> + return -EINVAL;
>> +
>> + return failover_ops->slave_release(slave_dev, failover_dev);
>> + }
>> +
>> + finfo = netdev_priv(failover_dev);
>> + primary_dev = rtnl_dereference(finfo->primary_dev);
>> + standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> + if (slave_dev == standby_dev) {
>> + RCU_INIT_POINTER(finfo->standby_dev, NULL);
>> + } else {
>> + RCU_INIT_POINTER(finfo->primary_dev, NULL);
>> + if (standby_dev) {
>> + failover_dev->min_mtu = standby_dev->min_mtu;
>> + failover_dev->max_mtu = standby_dev->max_mtu;
>> + }
>> + }
>> +
>> + dev_put(slave_dev);
>> +
>> + netdev_info(failover_dev, "failover slave:%s released\n",
>> + slave_dev->name);
>> +
>> + return 0;
>> +}
>> +
>> +int failover_slave_unregister(struct net_device *slave_dev)
>> +{
>> + struct failover_ops *failover_ops;
>> + struct net_device *failover_dev;
>> + int ret;
>> +
>> + if (!netif_is_failover_slave(slave_dev))
>> + goto done;
>> +
>> + ASSERT_RTNL();
>> +
>> + failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> + if (!failover_dev)
>> + goto done;
>> +
>> + ret = failover_slave_pre_unregister(slave_dev, failover_dev,
>> + failover_ops);
>> + if (ret)
>> + goto done;
>> +
>> + netdev_rx_handler_unregister(slave_dev);
>> + netdev_upper_dev_unlink(slave_dev, failover_dev);
>> + slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
>> +
>> + failover_slave_release(slave_dev, failover_dev, failover_ops);
>
> Don't you need to get stats from it? This device is going away ...
Yes. we need to update the failover_stats before the slave goes away.
>
>> +
>> + netdev_info(failover_dev, "failover slave:%s unregistered\n",
>> + slave_dev->name);
>> +
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_slave_unregister);
>> +
>> +static bool failover_xmit_ready(struct net_device *dev)
>> +{
>> + return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static int failover_slave_link_change(struct net_device *slave_dev)
>> +{
>> + struct net_device *failover_dev, *primary_dev, *standby_dev;
>> + struct failover_ops *failover_ops;
>> + struct failover_info *finfo;
>> +
>> + if (!netif_is_failover_slave(slave_dev))
>> + goto done;
>> +
>> + ASSERT_RTNL();
>> +
>> + failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
>> + if (!failover_dev)
>> + goto done;
>> +
>> + if (failover_ops) {
>> + if (!failover_ops->slave_link_change)
>> + goto done;
>> +
>> + return failover_ops->slave_link_change(slave_dev, failover_dev);
>> + }
>> +
>> + if (!netif_running(failover_dev))
>> + return 0;
>> +
>> + finfo = netdev_priv(failover_dev);
>> +
>> + primary_dev = rtnl_dereference(finfo->primary_dev);
>> + standby_dev = rtnl_dereference(finfo->standby_dev);
>> +
>> + if (slave_dev != primary_dev && slave_dev != standby_dev)
>> + goto done;
>> +
>> + if ((primary_dev && failover_xmit_ready(primary_dev)) ||
>> + (standby_dev && failover_xmit_ready(standby_dev))) {
>> + netif_carrier_on(failover_dev);
>> + netif_tx_wake_all_queues(failover_dev);
>> + } else {
>> + netif_carrier_off(failover_dev);
>> + netif_tx_stop_all_queues(failover_dev);
> And I think it's a good idea to get stats from device here too.
Not sure why we need to get stats from lower devs here?
>
>
>> + }
>> +
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +
>> +static bool failover_validate_event_dev(struct net_device *dev)
>> +{
>> + /* Skip parent events */
>> + if (netif_is_failover(dev))
>> + return false;
>> +
>> + /* Avoid non-Ethernet type devices */
> ... for now. It would be possible easy to make this generic -
> just copy things like type and addr_len from slave.
ok. failover_create can copy these values from standby_dev and
we can validate that they match when primary_dev registers.
>
>> + if (dev->type != ARPHRD_ETHER)
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> +static int
>> +failover_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> + if (!failover_validate_event_dev(event_dev))
>> + return NOTIFY_DONE;
>> +
>> + switch (event) {
>> + case NETDEV_REGISTER:
>> + return failover_slave_register(event_dev);
>> + case NETDEV_UNREGISTER:
>> + return failover_slave_unregister(event_dev);
>> + case NETDEV_UP:
>> + case NETDEV_DOWN:
>> + case NETDEV_CHANGE:
>> + return failover_slave_link_change(event_dev);
>> + default:
>> + return NOTIFY_DONE;
>> + }
>> +}
>> +
>> +static struct notifier_block failover_notifier = {
>> + .notifier_call = failover_event,
>> +};
>> +
>> +static int failover_open(struct net_device *dev)
>> +{
>> + struct failover_info *finfo = netdev_priv(dev);
>> + struct net_device *primary_dev, *standby_dev;
>> + int err;
>> +
>> + netif_carrier_off(dev);
>> + netif_tx_wake_all_queues(dev);
>> +
>> + primary_dev = rtnl_dereference(finfo->primary_dev);
>> + if (primary_dev) {
>> + err = dev_open(primary_dev);
>> + if (err)
>> + goto err_primary_open;
>> + }
>> +
>> + standby_dev = rtnl_dereference(finfo->standby_dev);
>> + if (standby_dev) {
>> + err = dev_open(standby_dev);
>> + if (err)
>> + goto err_standby_open;
>> + }
>> +
>> + return 0;
>> +
>> +err_standby_open:
>> + dev_close(primary_dev);
>> +err_primary_open:
>> + netif_tx_disable(dev);
>> + return err;
>> +}
>> +
>> +static int failover_close(struct net_device *dev)
>> +{
>> + struct failover_info *finfo = netdev_priv(dev);
>> + struct net_device *slave_dev;
>> +
>> + netif_tx_disable(dev);
>> +
>> + slave_dev = rtnl_dereference(finfo->primary_dev);
>> + if (slave_dev)
>> + dev_close(slave_dev);
>> +
>> + slave_dev = rtnl_dereference(finfo->standby_dev);
>> + if (slave_dev)
>> + dev_close(slave_dev);
>> +
>> + return 0;
>> +}
>> +
>> +static netdev_tx_t failover_drop_xmit(struct sk_buff *skb,
>> + struct net_device *dev)
>> +{
>> + atomic_long_inc(&dev->tx_dropped);
>> + dev_kfree_skb_any(skb);
>> + return NETDEV_TX_OK;
>> +}
>> +
>> +static netdev_tx_t failover_start_xmit(struct sk_buff *skb,
>> + struct net_device *dev)
>> +{
>> + struct failover_info *finfo = netdev_priv(dev);
>> + struct net_device *xmit_dev;
>> +
>> + /* Try xmit via primary netdev followed by standby netdev */
>> + xmit_dev = rcu_dereference_bh(finfo->primary_dev);
>> + if (!xmit_dev || !failover_xmit_ready(xmit_dev)) {
>> + xmit_dev = rcu_dereference_bh(finfo->standby_dev);
>> + if (!xmit_dev || !failover_xmit_ready(xmit_dev))
>> + return failover_drop_xmit(skb, dev);
>> + }
>> +
>> + skb->dev = xmit_dev;
>> + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> + return dev_queue_xmit(skb);
>> +}
> Is this going through qdisc twice? Won't this hurt performance
> measureably?
The failover dev has no queue (IFF_NO_QUEUE) , so doesn't go through qdisc twice.
>
>> +
>> +static u16 failover_select_queue(struct net_device *dev, struct sk_buff *skb,
>> + void *accel_priv,
>> + select_queue_fallback_t fallback)
>> +{
>> + struct failover_info *finfo = netdev_priv(dev);
>> + struct net_device *primary_dev;
>> + u16 txq;
>> +
>> + rcu_read_lock();
>> + primary_dev = rcu_dereference(finfo->primary_dev);
>> + if (primary_dev) {
>> + const struct net_device_ops *ops = primary_dev->netdev_ops;
>> +
>> + if (ops->ndo_select_queue)
>> + txq = ops->ndo_select_queue(primary_dev, skb,
>> + accel_priv, fallback);
>> + else
>> + txq = fallback(primary_dev, skb);
>> +
>> + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> + return txq;
>> + }
>> +
>> + txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> + /* Save the original txq to restore before passing to the driver */
>> + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> + if (unlikely(txq >= dev->real_num_tx_queues)) {
>> + do {
>> + txq -= dev->real_num_tx_queues;
>> + } while (txq >= dev->real_num_tx_queues);
>> + }
>> +
>> + return txq;
>> +}
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32finfot values only.
>> + */
>> +static void failover_fold_stats(struct rtnl_link_stats64 *_res,
>> + const struct rtnl_link_stats64 *_new,
>> + const struct rtnl_link_stats64 *_old)
>> +{
>> + const u64 *new = (const u64 *)_new;
>> + const u64 *old = (const u64 *)_old;
>> + u64 *res = (u64 *)_res;
>> + int i;
>> +
>> + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> + u64 nv = new[i];
>> + u64 ov = old[i];
>> + s64 delta = nv - ov;
>> +
>> + /* detects if this particular field is 32bit only */
>> + if (((nv | ov) >> 32) == 0)
>> + delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> + /* filter anomalies, some drivers reset their stats
>> + * at down/up events.
>> + */
>> + if (delta > 0)
>> + res[i] += delta;
>> + }
>> +}
>> +
>> +static void failover_get_stats(struct net_device *dev,
>> + struct rtnl_link_stats64 *stats)
>> +{
>> + struct failover_info *finfo = netdev_priv(dev);
>> + const struct rtnl_link_stats64 *new;
>> + struct rtnl_link_stats64 temp;
>> + struct net_device *slave_dev;
>> +
>> + spin_lock(&finfo->stats_lock);
>> + memcpy(stats, &finfo->failover_stats, sizeof(*stats));
>> +
>> + rcu_read_lock();
>> +
>> + slave_dev = rcu_dereference(finfo->primary_dev);
>> + if (slave_dev) {
>> + new = dev_get_stats(slave_dev, &temp);
>> + failover_fold_stats(stats, new, &finfo->primary_stats);
>> + memcpy(&finfo->primary_stats, new, sizeof(*new));
>> + }
>> +
>> + slave_dev = rcu_dereference(finfo->standby_dev);
>> + if (slave_dev) {
>> + new = dev_get_stats(slave_dev, &temp);
>> + failover_fold_stats(stats, new, &finfo->standby_stats);
>> + memcpy(&finfo->standby_stats, new, sizeof(*new));
>> + }
>> +
>> + rcu_read_unlock();
>> +
>> + memcpy(&finfo->failover_stats, stats, sizeof(*stats));
>> + spin_unlock(&finfo->stats_lock);
>> +}
>> +
>> +static int failover_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> + struct failover_info *finfo = netdev_priv(dev);
>> + struct net_device *primary_dev, *standby_dev;
>> + int ret = 0;
>> +
>> + primary_dev = rcu_dereference(finfo->primary_dev);
>> + if (primary_dev) {
>> + ret = dev_set_mtu(primary_dev, new_mtu);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + standby_dev = rcu_dereference(finfo->standby_dev);
>> + if (standby_dev) {
>> + ret = dev_set_mtu(standby_dev, new_mtu);
>> + if (ret) {
>> + dev_set_mtu(primary_dev, dev->mtu);
>> + return ret;
>> + }
>> + }
>> +
>> + dev->mtu = new_mtu;
>> + return 0;
>> +}
>> +
>> +static void failover_set_rx_mode(struct net_device *dev)
>> +{
>> + struct failover_info *finfo = netdev_priv(dev);
>> + struct net_device *slave_dev;
>> +
>> + rcu_read_lock();
>> +
>> + slave_dev = rcu_dereference(finfo->primary_dev);
>> + if (slave_dev) {
>> + dev_uc_sync_multiple(slave_dev, dev);
>> + dev_mc_sync_multiple(slave_dev, dev);
>> + }
>> +
>> + slave_dev = rcu_dereference(finfo->standby_dev);
>> + if (slave_dev) {
>> + dev_uc_sync_multiple(slave_dev, dev);
>> + dev_mc_sync_multiple(slave_dev, dev);
>> + }
>> +
>> + rcu_read_unlock();
>> +}
>> +
>> +static const struct net_device_ops failover_dev_ops = {
>> + .ndo_open = failover_open,
>> + .ndo_stop = failover_close,
>> + .ndo_start_xmit = failover_start_xmit,
>> + .ndo_select_queue = failover_select_queue,
>> + .ndo_get_stats64 = failover_get_stats,
>> + .ndo_change_mtu = failover_change_mtu,
>> + .ndo_set_rx_mode = failover_set_rx_mode,
>> + .ndo_validate_addr = eth_validate_addr,
>> + .ndo_features_check = passthru_features_check,
> xdp support?
I think it should be possible to add it be calling the lower dev ndo_xdp routines
with proper checks. can we add this later?
>
>> +};
>> +
>> +#define FAILOVER_NAME "failover"
>> +#define FAILOVER_VERSION "0.1"
>> +
>> +static void failover_ethtool_get_drvinfo(struct net_device *dev,
>> + struct ethtool_drvinfo *drvinfo)
>> +{
>> + strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
>> + strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +int failover_ethtool_get_link_ksettings(struct net_device *dev,
>> + struct ethtool_link_ksettings *cmd)
>> +{
>> + struct failover_info *finfo = netdev_priv(dev);
>> + struct net_device *slave_dev;
>> +
>> + slave_dev = rtnl_dereference(finfo->primary_dev);
>> + if (!slave_dev || !failover_xmit_ready(slave_dev)) {
>> + slave_dev = rtnl_dereference(finfo->standby_dev);
>> + if (!slave_dev || !failover_xmit_ready(slave_dev)) {
>> + cmd->base.duplex = DUPLEX_UNKNOWN;
>> + cmd->base.port = PORT_OTHER;
>> + cmd->base.speed = SPEED_UNKNOWN;
>> +
>> + return 0;
>> + }
>> + }
>> +
>> + return __ethtool_get_link_ksettings(slave_dev, cmd);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_ethtool_get_link_ksettings);
>> +
>> +static const struct ethtool_ops failover_ethtool_ops = {
>> + .get_drvinfo = failover_ethtool_get_drvinfo,
>> + .get_link = ethtool_op_get_link,
>> + .get_link_ksettings = failover_ethtool_get_link_ksettings,
>> +};
>> +
>> +static void failover_register_existing_slave(struct net_device *failover_dev)
>> +{
>> + struct net *net = dev_net(failover_dev);
>> + struct net_device *dev;
>> +
>> + rtnl_lock();
>> + for_each_netdev(net, dev) {
>> + if (dev == failover_dev)
>> + continue;
>> + if (!failover_validate_event_dev(dev))
>> + continue;
>> + if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
>> + failover_slave_register(dev);
>> + }
>> + rtnl_unlock();
>> +}
>> +
>> +int failover_register(struct net_device *dev, struct failover_ops *ops,
>> + struct failover **pfailover)
>> +{
>> + struct failover *failover;
>> +
>> + failover = kzalloc(sizeof(*failover), GFP_KERNEL);
>> + if (!failover)
>> + return -ENOMEM;
>> +
>> + rcu_assign_pointer(failover->ops, ops);
>> + dev_hold(dev);
>> + dev->priv_flags |= IFF_FAILOVER;
>> + rcu_assign_pointer(failover->failover_dev, dev);
>> +
>> + spin_lock(&failover_lock);
>> + list_add_tail(&failover->list, &failover_list);
>> + spin_unlock(&failover_lock);
>> +
>> + failover_register_existing_slave(dev);
>> +
>> + *pfailover = failover;
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_register);
>> +
>> +void failover_unregister(struct failover *failover)
>> +{
>> + struct net_device *failover_dev;
>> +
>> + failover_dev = rcu_dereference(failover->failover_dev);
>> +
>> + failover_dev->priv_flags &= ~IFF_FAILOVER;
>> + dev_put(failover_dev);
>> +
>> + spin_lock(&failover_lock);
>> + list_del(&failover->list);
>> + spin_unlock(&failover_lock);
>> +
>> + kfree(failover);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_unregister);
>> +
>> +int failover_create(struct net_device *standby_dev, struct failover **pfailover)
>> +{
>> + struct device *dev = standby_dev->dev.parent;
>> + struct net_device *failover_dev;
>> + int err;
>> +
>> + /* Alloc at least 2 queues, for now we are going with 16 assuming
>> + * that most devices being bonded won't have too many queues.
>> + */
>> + failover_dev = alloc_etherdev_mq(sizeof(struct failover_info), 16);
>> + if (!failover_dev) {
>> + dev_err(dev, "Unable to allocate failover_netdev!\n");
>> + return -ENOMEM;
>> + }
>> +
>> + dev_net_set(failover_dev, dev_net(standby_dev));
>> + SET_NETDEV_DEV(failover_dev, dev);
>> +
>> + failover_dev->netdev_ops = &failover_dev_ops;
>> + failover_dev->ethtool_ops = &failover_ethtool_ops;
>> +
>> + /* Initialize the device options */
>> + failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> + failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> + IFF_TX_SKB_SHARING);
>> +
>> + /* don't acquire failover netdev's netif_tx_lock when transmitting */
>> + failover_dev->features |= NETIF_F_LLTX;
>> +
>> + /* Don't allow failover devices to change network namespaces. */
>> + failover_dev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> + failover_dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> + NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> + NETIF_F_HIGHDMA | NETIF_F_LRO;
> OK but then you must make sure your primary and standby both
> support these features.
I guess i need to add something like bond_compute_features() to handle this.
>
>> +
>> + failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> + failover_dev->features |= failover_dev->hw_features;
>> +
>> + memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
>> + failover_dev->addr_len);
>> +
>> + failover_dev->min_mtu = standby_dev->min_mtu;
>> + failover_dev->max_mtu = standby_dev->max_mtu;
> OK MTU is copied, fine. But is this always enough?
>
> How about e.g. hard_header_len? min_header_len? needed_headroom?
> needed_tailroom? I'd worry that even if you cover existing ones more
> might be added with time. A function copying config between devices
> probably belongs in some central place IMHO.
ok. will add a function that handles these fields too.
>
>
>> +
>> + err = register_netdev(failover_dev);
>> + if (err < 0) {
>> + dev_err(dev, "Unable to register failover_dev!\n");
>> + goto err_register_netdev;
>> + }
>> +
>> + netif_carrier_off(failover_dev);
>> +
>> + err = failover_register(failover_dev, NULL, pfailover);
>> + if (err < 0)
>> + goto err_failover;
>> +
>> + return 0;
>> +
>> +err_failover:
>> + unregister_netdev(failover_dev);
>> +err_register_netdev:
>> + free_netdev(failover_dev);
>> +
>> + return err;
>> +}
>> +EXPORT_SYMBOL_GPL(failover_create);
>> +
>> +void failover_destroy(struct failover *failover)
>> +{
>> + struct net_device *failover_dev;
>> + struct net_device *slave_dev;
>> + struct failover_info *finfo;
>> +
>> + if (!failover)
>> + return;
>> +
>> + failover_dev = rcu_dereference(failover->failover_dev);
>> + finfo = netdev_priv(failover_dev);
>> +
>> + netif_device_detach(failover_dev);
>> +
>> + rtnl_lock();
>> +
>> + slave_dev = rtnl_dereference(finfo->primary_dev);
>> + if (slave_dev)
>> + failover_slave_unregister(slave_dev);
>> +
>> + slave_dev = rtnl_dereference(finfo->standby_dev);
>> + if (slave_dev)
>> + failover_slave_unregister(slave_dev);
>> +
>> + failover_unregister(failover);
>> +
>> + unregister_netdevice(failover_dev);
>> +
>> + rtnl_unlock();
>> +
>> + free_netdev(failover_dev);
>> +}
>> +EXPORT_SYMBOL_GPL(failover_destroy);
>> +
>> +static __init int
>> +failover_init(void)
>> +{
>> + register_netdevice_notifier(&failover_notifier);
>> +
>> + return 0;
>> +}
>> +module_init(failover_init);
>> +
>> +static __exit
>> +void failover_exit(void)
>> +{
>> + unregister_netdevice_notifier(&failover_notifier);
>> +}
>> +module_exit(failover_exit);
>> +
>> +MODULE_DESCRIPTION("Failover infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> --
>> 2.14.3
^ permalink raw reply
* Re: [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: Jonathan Corbet @ 2018-04-20 15:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Eric Dumazet, David S . Miller, netdev, Neal Cardwell,
Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <1e7a78a6-27cd-9679-54d7-44d05484eda7@gmail.com>
On Thu, 19 Apr 2018 18:01:32 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> We can keep mmap() nice interface, granted we can add one hook like in following patch.
>
> David, do you think such patch would be acceptable by lkml and mm/fs maintainers ?
>
> Alternative would be implementing an ioctl() or getsockopt() operation,
> but it seems less natural...
So I have little standing here, but what the heck, not letting that bother
me has earned me a living for the last 20 years or so...:)
I think you should consider switching over to an interface where you
mmap() the region once, and use ioctl() to move the data into that region,
for a couple of reasons beyond the locking issues you've already found:
- The "mmap() consumes data" semantics are a bit ... strange, IMO.
That's not what mmap() normally does. People expect ioctl() to do
magic things, instead.
- I would expect it to be a tiny bit faster, since you wouldn't be doing
the VMA setup and teardown each time.
Thanks,
jon
^ permalink raw reply
* Re: [PATCH net-next 0/4] geneve: verify user specified MTU or adjust with a lower device
From: David Miller @ 2018-04-20 15:19 UTC (permalink / raw)
To: alexey.kodanev; +Cc: netdev
In-Reply-To: <1524141752-25789-1-git-send-email-alexey.kodanev@oracle.com>
From: Alexey Kodanev <alexey.kodanev@oracle.com>
Date: Thu, 19 Apr 2018 15:42:28 +0300
> The first two patches don't introduce any functional changes and
> contain minor cleanups for code readability.
>
> The last one adds a new function geneve_link_config() similar to the
> other tunnels. The function will be used on a new link creation or
> when 'remote' parameter is changed. It adjusts a user specified MTU
> or, if it finds a lower device, tunes the tunnel MTU using it.
Ok, series applied, thanks.
^ permalink raw reply
* [PATCH v3 net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Yafang Shao @ 2018-04-20 15:18 UTC (permalink / raw)
To: eric.dumazet, alexei.starovoitov, songliubraving, davem
Cc: netdev, linux-kernel, Yafang Shao
tcp_rcv_space_adjust is called every time data is copied to user space,
introducing a tcp tracepoint for which could show us when the packet is
copied to user.
When a tcp packet arrives, tcp_rcv_established() will be called and with
the existed tracepoint tcp_probe we could get the time when this packet
arrives.
Then this packet will be copied to user, and tcp_rcv_space_adjust will
be called and with this new introduced tracepoint we could get the time
when this packet is copied to user.
With these two tracepoints, we could figure out whether the user program
processes this packet immediately or there's latency.
Hence in the printk message, sk_cookie is printed as a key to relate
tcp_rcv_space_adjust with tcp_probe.
Maybe we could export sockfd in this new tracepoint as well, then we
could relate this new tracepoint with epoll/read/recv* tracepoints, and
finally that could show us the whole lifespan of this packet. But we
could also implement that with pid as these functions are executed in
process context.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
v2 -> v3: use sock_gen_cookie in tcp_event_sk as well.
Maybe we could init sk_cookie in the stack then in other code
path we just read it other than set it. If that is needed I
will do it in another new patch.
v1 -> v2: use sk_cookie as a key suggested by Eric.
---
include/trace/events/tcp.h | 30 +++++++++++++++++++++---------
net/ipv4/tcp_input.c | 2 ++
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 3dd6802..c1a5284 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -10,6 +10,7 @@
#include <linux/tracepoint.h>
#include <net/ipv6.h>
#include <net/tcp.h>
+#include <linux/sock_diag.h>
#define TP_STORE_V4MAPPED(__entry, saddr, daddr) \
do { \
@@ -113,7 +114,7 @@
*/
DECLARE_EVENT_CLASS(tcp_event_sk,
- TP_PROTO(const struct sock *sk),
+ TP_PROTO(struct sock *sk),
TP_ARGS(sk),
@@ -125,6 +126,7 @@
__array(__u8, daddr, 4)
__array(__u8, saddr_v6, 16)
__array(__u8, daddr_v6, 16)
+ __field(__u64, sock_cookie)
),
TP_fast_assign(
@@ -144,24 +146,34 @@
TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
+
+ __entry->sock_cookie = sock_gen_cookie(sk);
),
- TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c",
+ TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c sock_cookie=%llx",
__entry->sport, __entry->dport,
__entry->saddr, __entry->daddr,
- __entry->saddr_v6, __entry->daddr_v6)
+ __entry->saddr_v6, __entry->daddr_v6,
+ __entry->sock_cookie)
);
DEFINE_EVENT(tcp_event_sk, tcp_receive_reset,
- TP_PROTO(const struct sock *sk),
+ TP_PROTO(struct sock *sk),
TP_ARGS(sk)
);
DEFINE_EVENT(tcp_event_sk, tcp_destroy_sock,
- TP_PROTO(const struct sock *sk),
+ TP_PROTO(struct sock *sk),
+
+ TP_ARGS(sk)
+);
+
+DEFINE_EVENT(tcp_event_sk, tcp_rcv_space_adjust,
+
+ TP_PROTO(struct sock *sk),
TP_ARGS(sk)
);
@@ -232,6 +244,7 @@
__field(__u32, snd_wnd)
__field(__u32, srtt)
__field(__u32, rcv_wnd)
+ __field(__u64, sock_cookie)
),
TP_fast_assign(
@@ -256,15 +269,14 @@
__entry->rcv_wnd = tp->rcv_wnd;
__entry->ssthresh = tcp_current_ssthresh(sk);
__entry->srtt = tp->srtt_us >> 3;
+ __entry->sock_cookie = sock_gen_cookie(sk);
),
- TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x "
- "snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u "
- "rcv_wnd=%u",
+ TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u rcv_wnd=%u sock_cookie=%llx",
__entry->saddr, __entry->daddr, __entry->mark,
__entry->length, __entry->snd_nxt, __entry->snd_una,
__entry->snd_cwnd, __entry->ssthresh, __entry->snd_wnd,
- __entry->srtt, __entry->rcv_wnd)
+ __entry->srtt, __entry->rcv_wnd, __entry->sock_cookie)
);
#endif /* _TRACE_TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f93687f..43ad468 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -582,6 +582,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
u32 copied;
int time;
+ trace_tcp_rcv_space_adjust(sk);
+
tcp_mstamp_refresh(tp);
time = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcvq_space.time);
if (time < (tp->rcv_rtt_est.rtt_us >> 3) || tp->rcv_rtt_est.rtt_us == 0)
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next v3] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Yafang Shao @ 2018-04-20 15:14 UTC (permalink / raw)
To: eric.dumazet, alexei.starovoitov, songliubraving, davem
Cc: netdev, linux-kernel, Yafang Shao
tcp_rcv_space_adjust is called every time data is copied to user space,
introducing a tcp tracepoint for which could show us when the packet is
copied to user.
When a tcp packet arrives, tcp_rcv_established() will be called and with
the existed tracepoint tcp_probe we could get the time when this packet
arrives.
Then this packet will be copied to user, and tcp_rcv_space_adjust will
be called and with this new introduced tracepoint we could get the time
when this packet is copied to user.
With these two tracepoints, we could figure out whether the user program
processes this packet immediately or there's latency.
Hence in the printk message, sk_cookie is printed as a key to relate
tcp_rcv_space_adjust with tcp_probe.
Maybe we could export sockfd in this new tracepoint as well, then we
could relate this new tracepoint with epoll/read/recv* tracepoints, and
finally that could show us the whole lifespan of this packet. But we
could also implement that with pid as these functions are executed in
process context.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
v2 -> v3: use sock_gen_cookie in tcp_event_sk as well.
Maybe we could init sk_cookie in the stack then in other code
path we just read it other than set it. If that is needed I
will do it in another new patch.
v1 -> v2: use sk_cookie as a key suggested by Eric.
---
include/trace/events/tcp.h | 30 +++++++++++++++++++++---------
net/ipv4/tcp_input.c | 2 ++
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 3dd6802..c820bf6 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -10,6 +10,7 @@
#include <linux/tracepoint.h>
#include <net/ipv6.h>
#include <net/tcp.h>
+#include <linux/sock_diag.h>
#define TP_STORE_V4MAPPED(__entry, saddr, daddr) \
do { \
@@ -113,7 +114,7 @@
*/
DECLARE_EVENT_CLASS(tcp_event_sk,
- TP_PROTO(const struct sock *sk),
+ TP_PROTO(struct sock *sk),
TP_ARGS(sk),
@@ -125,6 +126,7 @@
__array(__u8, daddr, 4)
__array(__u8, saddr_v6, 16)
__array(__u8, daddr_v6, 16)
+ __field(__u64, sock_cookie)
),
TP_fast_assign(
@@ -144,24 +146,34 @@
TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
+
+ __entry->sock_cookie = sock_gen_cookie(sk);
),
- TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c",
+ TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c sock_cookie=%llu",
__entry->sport, __entry->dport,
__entry->saddr, __entry->daddr,
- __entry->saddr_v6, __entry->daddr_v6)
+ __entry->saddr_v6, __entry->daddr_v6,
+ __entry->sock_cookie)
);
DEFINE_EVENT(tcp_event_sk, tcp_receive_reset,
- TP_PROTO(const struct sock *sk),
+ TP_PROTO(struct sock *sk),
TP_ARGS(sk)
);
DEFINE_EVENT(tcp_event_sk, tcp_destroy_sock,
- TP_PROTO(const struct sock *sk),
+ TP_PROTO(struct sock *sk),
+
+ TP_ARGS(sk)
+);
+
+DEFINE_EVENT(tcp_event_sk, tcp_rcv_space_adjust,
+
+ TP_PROTO(struct sock *sk),
TP_ARGS(sk)
);
@@ -232,6 +244,7 @@
__field(__u32, snd_wnd)
__field(__u32, srtt)
__field(__u32, rcv_wnd)
+ __field(__u64, sock_cookie)
),
TP_fast_assign(
@@ -256,15 +269,14 @@
__entry->rcv_wnd = tp->rcv_wnd;
__entry->ssthresh = tcp_current_ssthresh(sk);
__entry->srtt = tp->srtt_us >> 3;
+ __entry->sock_cookie = sock_gen_cookie(sk);
),
- TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x "
- "snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u "
- "rcv_wnd=%u",
+ TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u rcv_wnd=%u sock_cookie=%llx",
__entry->saddr, __entry->daddr, __entry->mark,
__entry->length, __entry->snd_nxt, __entry->snd_una,
__entry->snd_cwnd, __entry->ssthresh, __entry->snd_wnd,
- __entry->srtt, __entry->rcv_wnd)
+ __entry->srtt, __entry->rcv_wnd, __entry->sock_cookie)
);
#endif /* _TRACE_TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f93687f..43ad468 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -582,6 +582,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
u32 copied;
int time;
+ trace_tcp_rcv_space_adjust(sk);
+
tcp_mstamp_refresh(tp);
time = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcvq_space.time);
if (time < (tp->rcv_rtt_est.rtt_us >> 3) || tp->rcv_rtt_est.rtt_us == 0)
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH net-next v3] team: account for oper state
From: David Miller @ 2018-04-20 15:13 UTC (permalink / raw)
To: gwilkie; +Cc: jiri, netdev
In-Reply-To: <20180419103414.542-1-gwilkie@vyatta.att-mail.com>
From: George Wilkie <gwilkie@vyatta.att-mail.com>
Date: Thu, 19 Apr 2018 11:34:14 +0100
> Account for operational state when determining port linkup state,
> as per Documentation/networking/operstates.txt.
>
> Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
Applied.
^ permalink raw reply
* Re: [net-next 0/3] tipc: Confgiuration of MTU for media UDP
From: David Miller @ 2018-04-20 15:04 UTC (permalink / raw)
To: mohan.krishna.ghanta.krishnamurthy
Cc: tipc-discussion, jon.maloy, maloy, ying.xue, netdev
In-Reply-To: <1524128780-2550-1-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>
From: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Date: Thu, 19 Apr 2018 11:06:17 +0200
> Systematic measurements have shown that an emulated MTU of 14k for
> UDP bearers is the optimal value for maximal throughput. Accordingly,
> the default MTU of UDP bearers is changed to 14k.
>
> We also provide users with a fallback option from this value,
> by providing support to configure MTU for UDP bearers. The following
> options are introduced which are symmetrical to the design of
> confguring link tolerance.
>
> - Configure media with new MTU value, which will take effect on
> links going up after the moment it was configured. Alternatively,
> the bearer has to be disabled and re-enabled, for existing links to
> reflect the configured value.
>
> - Configure bearer with new MTU value, which take effect on
> running links dynamically.
>
> Please note:
> - User has to change MTU at both endpoints, otherwise the link
> will fall back to smallest MTU after a reset.
> - Failover from a link with higher MTU to a link with lower MTU
There are many negatives to using UDP in a way which causes
fragmentation, like this code now does.
But whatever, you guys can do whatever you want and get to keep the
pieces I guess :-)
Series applied.
^ permalink raw reply
* Re: [virtio-dev] [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Alexander Duyck @ 2018-04-20 14:56 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Daly, Dan, Bjorn Helgaas, Duyck, Alexander H, linux-pci,
virtio-dev, kvm, Netdev, LKML, linux-nvme, Keith Busch, netanel,
Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <20180420030640-mutt-send-email-mst@kernel.org>
On Thu, Apr 19, 2018 at 5:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 03, 2018 at 12:06:03PM -0700, Alexander Duyck wrote:
>> On Tue, Apr 3, 2018 at 11:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Apr 03, 2018 at 10:32:00AM -0700, Alexander Duyck wrote:
>> >> On Tue, Apr 3, 2018 at 6:12 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Fri, Mar 16, 2018 at 09:40:34AM -0700, Alexander Duyck wrote:
>> >> >> On Fri, Mar 16, 2018 at 9:34 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> > On Thu, Mar 15, 2018 at 11:42:41AM -0700, Alexander Duyck wrote:
>> >> >> >> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> >> >> >>
>> >> >> >> Hardware-realized virtio_pci devices can implement SR-IOV, so this
>> >> >> >> patch enables its use. The device in question is an upcoming Intel
>> >> >> >> NIC that implements both a virtio_net PF and virtio_net VFs. These
>> >> >> >> are hardware realizations of what has been up to now been a software
>> >> >> >> interface.
>> >> >> >>
>> >> >> >> The device in question has the following 4-part PCI IDs:
>> >> >> >>
>> >> >> >> PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
>> >> >> >> VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
>> >> >> >>
>> >> >> >> The patch currently needs no check for device ID, because the callback
>> >> >> >> will never be made for devices that do not assert the capability or
>> >> >> >> when run on a platform incapable of SR-IOV.
>> >> >> >>
>> >> >> >> One reason for this patch is because the hardware requires the
>> >> >> >> vendor ID of a VF to be the same as the vendor ID of the PF that
>> >> >> >> created it. So it seemed logical to simply have a fully-functioning
>> >> >> >> virtio_net PF create the VFs. This patch makes that possible.
>> >> >> >>
>> >> >> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> >> >> >> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
>> >> >> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> >> >> >
>> >> >> > So if and when virtio PFs can manage the VFs, then we can
>> >> >> > add a feature bit for that?
>> >> >> > Seems reasonable.
>> >> >>
>> >> >> Yes. If nothing else you may not even need a feature bit depending on
>> >> >> how things go.
>> >> >
>> >> > OTOH if the interface is changed in an incompatible way,
>> >> > and old Linux will attempt to drive the new device
>> >> > since there is no check.
>> >> >
>> >> > I think we should add a feature bit right away.
>> >>
>> >> I'm not sure why you would need a feature bit. The capability is
>> >> controlled via PCI configuration space. If it is present the device
>> >> has the capability. If it is not then it does not.
>> >>
>> >> Basically if the PCI configuration space is not present then the sysfs
>> >> entries will not be spawned and nothing will attempt to use this
>> >> function.
>> >>
>> >> - ALex
>> >
>> > It's about compability with older guests which ignore the
>> > capability.
>> >
>> > The feature is thus helpful so host knows whether guest supports VFs.
>>
>> The thing is if the capability is ignored then the feature isn't used.
>> So for SR-IOV it isn't an uncommon thing for there to be drivers for
>> the PF floating around that do not support SR-IOV. In such cases
>> SR-IOV just isn't used while the hardware could support it.
>
> Right but how come there are VF drivers but PF driver does not
> know about these?
I'm not sure what you mean here. The VF and PF drivers are the same
driver. The only difference is that the PF has the extra SR-IOV
configuration space.
What this code is meant to enable is a form of SR-IOV where the VFs
are essentially pre-allocated resources. So for example in our case
the MMIO space is identical for a PF versus any of the VFs. It doesn't
have any special controls in place to allow the PF to manipulate any
of the resources belonging to the VFs.
> And are there PF drivers that intentially do not enable SRIOV
> because it's known to be broken in some way?
In the Virtio IO case right now are there any devices that support
SR-IOV? For now this is just an add-on bit to a function that is
already emulating the Virtio in hardware.
> Case in point I do think virtio want to limit this
> depending on a feature bit on general principles
> (the principle being that all extensions have feature bits).
This part has me kind of scratching my head. In our setup the "PF" is
really nothing more than a "VF" with the SR-IOV configuration space
attached to it. There are already examples of similar designs for NVMe
and the Amazon ENA devices. Giving the "PF" any functionality in MMIO
space that controls the SR-IOV kind of defeats the whole point of
allowing this function in the first place. Basically the PF isn't
really controlling things, it is the kernel that is doing it.
> There are security implications here - we previously relied on
> whitelisting after all.
Yes and no. The original patch set had issues as you could have a PF
assigned to user space and the VFs managed by the host. When I changed
things so that the function had to be in a kernel driver that issue
went away.
> Wouldn't it be safer to be a bit more careful and update the
> actual PF drivers? It's just one line per driver, but it
> can be done with an ack by driver maintainer.
> If/once we find out all drivers do have it, we can then
> change the default.
I have no clue what you are talking about here. This is the more
careful approach. Are you sure you are reviewing the v7 of the
patches?
My understanding is that no paravirtual interfaces currently expose
SR-IOV. What we are looking at is hardware will want to emulate
Virtio, specifically virtio_net in the future and as a part of that
the PF ends up emulating it as well. What we would need to watch for
going forward is that any device that enables SR-IOV support would
need to also provide a 4 tuple ID so that if something goes wrong with
it we could disable SR-IOV on the device via a PCI quirk later.
>> I would think in the case of virtio it would be the same kind of
>> thing. Basically if SR-IOV is supported by the host then the
>> capability would be present. If SR-IOV is supported by the guest then
>> it would make use of the capability to spawn VFs. If either the
>> capability isn't present, or the driver doesn't use it then you won't
>> be able to spawn VFs in the guest.
>
>> Maybe I am missing something. Do you support dynamically changing the
>> PCI configuration space for Virtio devices based on the presence of
>> feature bits provided by the guest?
>
> No. The point is that IMHO at least virtio - in absence of feature bit -
> to ignore VFs rather than assume they are safe to drive
> in an unmanaged way.
>
>> Also are you saying this patch set should wait on the feature bit to
>> be added, or are you talking about doing this as some sort of
>> follow-up?
>>
>> - Alex
>
> I think for virtio it should include the feature bit, yes.
> Adding feature bit is very easy - post a patch to the virtio TC mailing
> list, wait about a week to give people time to respond (two weeks if it
> is around holidays and such).
The problem is we are talking about hardware/FPGA, not software.
Adding a feature bit means going back and updating RTL. The software
side of things is easy, re-validating things after a hardware/FPGA
change not so much.
If this is a hard requirement I may just drop the virtio patch, push
what I have, and leave it to Mark/Dan to deal with the necessary RTL
and code changes needed to support Virtio as I don't expect the
turnaround to be as easy as just a patch.
Thanks.
- Alex
^ permalink raw reply
* Re: [PATCH net-next] liquidio: Added ndo_get_vf_stats support
From: David Miller @ 2018-04-20 14:54 UTC (permalink / raw)
To: felix.manlunas
Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
intiyaz.basha
In-Reply-To: <20180419061828.GA4292@felix-thinkpad.cavium.com>
From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Wed, 18 Apr 2018 23:18:28 -0700
> From: Intiyaz Basha <intiyaz.basha@cavium.com>
>
> Added the ndo to gather VF statistics through the PF.
>
> Collect VF statistics via mailbox from VF.
>
> Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Applied, thank you.
^ permalink raw reply
* Re: [PATCH 1/8] arm: shmobile: Change platform dependency to ARCH_RENESAS
From: Sergei Shtylyov @ 2018-04-20 14:53 UTC (permalink / raw)
To: Geert Uytterhoeven, Simon Horman, Magnus Damm, Russell King,
Catalin Marinas, Will Deacon, Dan Williams, Vinod Koul,
Mauro Carvalho Chehab, David S . Miller, Greg Kroah-Hartman,
Liam Girdwood, Mark Brown, Jaroslav Kysela, Takashi Iwai,
Arnd Bergmann, Kuninori Morimoto, Laurent Pinchart
Cc: linux-renesas-soc, linux-arm-kernel, dmaengine, linux-media,
netdev, devel, alsa-devel, linux-kernel
In-Reply-To: <1524230914-10175-2-git-send-email-geert+renesas@glider.be>
On 04/20/2018 04:28 PM, Geert Uytterhoeven wrote:
> Since commit 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce ARCH_RENESAS")
> is ARCH_RENESAS a more appropriate platform dependency than the legacy
"ARCH_RENESAS is", no?
> ARCH_SHMOBILE, hence use the former.
>
> This will allow to drop ARCH_SHMOBILE on ARM in the near future.
>
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
[...]
MBR, Sergei
^ permalink raw reply
* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Rahul Lakkireddy @ 2018-04-20 14:51 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Dave Young, netdev@vger.kernel.org, kexec@lists.infradead.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Indranil Choudhury, Nirranjan Kirubaharan,
stephen@networkplumber.org, Ganesh GR, akpm@linux-foundation.org,
torvalds@linux-foundation.org, davem@davemloft.net,
viro@zeniv.linux.org.uk
In-Reply-To: <87po2uhueu.fsf@xmission.com>
On Friday, April 04/20/18, 2018 at 19:06:09 +0530, Eric W. Biederman wrote:
> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
>
> > On Thursday, April 04/19/18, 2018 at 20:23:37 +0530, Eric W. Biederman wrote:
> >> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> >>
> >> > On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote:
> >> >> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
> >> >> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> >> >> > > Hi Rahul,
> >> >> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> >> >> > > > On production servers running variety of workloads over time, kernel
> >> >> > > > panic can happen sporadically after days or even months. It is
> >> >> > > > important to collect as much debug logs as possible to root cause
> >> >> > > > and fix the problem, that may not be easy to reproduce. Snapshot of
> >> >> > > > underlying hardware/firmware state (like register dump, firmware
> >> >> > > > logs, adapter memory, etc.), at the time of kernel panic will be very
> >> >> > > > helpful while debugging the culprit device driver.
> >> >> > > >
> >> >> > > > This series of patches add new generic framework that enable device
> >> >> > > > drivers to collect device specific snapshot of the hardware/firmware
> >> >> > > > state of the underlying device in the crash recovery kernel. In crash
> >> >> > > > recovery kernel, the collected logs are added as elf notes to
> >> >> > > > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >> >> > > >
> >> >> > > > The sequence of actions done by device drivers to append their device
> >> >> > > > specific hardware/firmware logs to /proc/vmcore are as follows:
> >> >> > > >
> >> >> > > > 1. During probe (before hardware is initialized), device drivers
> >> >> > > > register to the vmcore module (via vmcore_add_device_dump()), with
> >> >> > > > callback function, along with buffer size and log name needed for
> >> >> > > > firmware/hardware log collection.
> >> >> > >
> >> >> > > I assumed the elf notes info should be prepared while kexec_[file_]load
> >> >> > > phase. But I did not read the old comment, not sure if it has been discussed
> >> >> > > or not.
> >> >> > >
> >> >> >
> >> >> > We must not collect dumps in crashing kernel. Adding more things in
> >> >> > crash dump path risks not collecting vmcore at all. Eric had
> >> >> > discussed this in more detail at:
> >> >> >
> >> >> > https://lkml.org/lkml/2018/3/24/319
> >> >> >
> >> >> > We are safe to collect dumps in the second kernel. Each device dump
> >> >> > will be exported as an elf note in /proc/vmcore.
> >> >>
> >> >> I understand that we should avoid adding anything in crash path. And I also
> >> >> agree to collect device dump in second kernel. I just assumed device
> >> >> dump use some memory area to store the debug info and the memory
> >> >> is persistent so that this can be done in 2 steps, first register the
> >> >> address in elf header in kexec_load, then collect the dump in 2nd
> >> >> kernel. But it seems the driver is doing some other logic to collect
> >> >> the info instead of just that simple like I thought.
> >> >>
> >> >
> >> > It seems simpler, but I'm concerned with waste of memory area, if
> >> > there are no device dumps being collected in second kernel. In
> >> > approach proposed in these series, we dynamically allocate memory
> >> > for the device dumps from second kernel's available memory.
> >>
> >> Don't count that kernel having more than about 128MiB.
> >>
> >
> > If large dump is expected, Administrator can increase the memory
> > allocated to the second kernel (using crashkernel boot param), to
> > ensure device dumps get collected.
>
> Except 128MiB is already a already a huge amount to reserve. I
> typically have run crash dumps with 16MiB of memory and thought it was
> overkill. Looking below 32MiB seems a bit high but it is small enough
> that it is still doable. I am baffled at how 2GiB can be guaranteed to fit
> in 32MiB (sparse register space?) but if it works reliably.
>
Yes, we skip portions in on-chip memory dump that do not add to debug
value (such as the large regions reserved for holding Payload data
going through the device) and hence the overall dump size reduces
significantly.
> >> For that reason if for no other it would be nice if it was possible to
> >> have the driver to not initialize the device and just stand there
> >> handing out the data a piece at a time as it is read from /proc/vmcore.
> >>
> >
> > Since cxgb4 is a network driver, it can be used to transfer the dumps
> > over the network. So we must ensure the dumps get collected and
> > stored, before device gets initialized to transfer dumps over
> > the network.
>
> Good point. For some reason I was thinking it was an infiniband and not
> an 10GiB ethernet device.
>
> >> The 2GiB number I read earlier concerns me for working in a limited
> >> environment.
> >>
> >
> > All dumps, including the 2GB on-chip memory dump, is compressed by
> > the cxgb4 driver as they are collected. The overall compressed dump
> > comes out at max 32 MB.
> >
> >> It might even make sense to separate this into a completely separate
> >> module (depended upon the main driver if it makes sense to share
> >> the functionality) so that people performing crash dumps would not
> >> hesitate to include the code in their initramfs images.
> >>
> >> I can see splitting a device up into a portion only to be used in case
> >> of a crash dump and a normal portion like we do for main memory but I
> >> doubt that makes sense in practice.
> >>
> >
> > This is not required, especially in case of network drivers, which
> > must collect underlying device dump and initialize the device to
> > transfer dumps over the network.
>
> I have a practical concern. What happens if the previous kernel left
> the device in such a bad stat the driver can not successfully initialize
> it.
>
> Does failure to initialize cxgb4 after a crash now mean that you can not
> capture the crash dump to see the crazy state the device was in?
>
> Typically the initramfs for a crash dump does not include unnecessary
> drivers so that hardware in states the drivers can't handle won't
> prevent taking a crash dump.
>
> I understand the issue if you are taking a dump over your 10GiB ethernet
> it is a moot point. But if you are writing your dump to disk, or
> writing it over a management gigabit ethernet then it is still an issue.
>
> Is there a decoupling so that a totally b0rked device can't prevent
> taking it's own dump?
>
As long as we can safely map and access the BAR registers, we
can collect the dumps regardless of whatever state the device and
firmware were left in and store it as part of /proc/vmcore. After
that, we attempt to re-initialize the device and restart the
firmware. So, even if driver initialization fails at this point,
we still have the dumps as part of vmcore.
Thanks,
Rahul
^ permalink raw reply
* Re: [PATCH net-next v2 0/3] ave: fix the activation issues for some UniPhier SoCs
From: David Miller @ 2018-04-20 14:50 UTC (permalink / raw)
To: hayashi.kunihiko
Cc: netdev, andrew, f.fainelli, robh+dt, mark.rutland,
linux-arm-kernel, linux-kernel, devicetree, yamada.masahiro,
masami.hiramatsu, jaswinder.singh
In-Reply-To: <1524122695-19597-1-git-send-email-hayashi.kunihiko@socionext.com>
From: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Date: Thu, 19 Apr 2018 16:24:52 +0900
> This add the following stuffs to fix the activation issues and satisfy
> requirements for AVE ethernet driver implemented on some UniPhier SoCs.
>
> - Add support for additional necessary clocks and resets, because the kernel
> is stalled on Pro4 due to lack of them.
>
> - Check whether the SoC supports the specified phy-mode
>
> - Add DT property support indicating system controller that has the feature
> for configurating phy-mode including built-in phy on LD11.
>
> v1: https://www.spinics.net/lists/netdev/msg494904.html
>
> Changes since v1:
> - Add 'Reviewed-by' lines
Series applied to net-next, thank you.
^ permalink raw reply
* Re: [PATCH] [net] ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
From: Ahmed Abdelsalam @ 2018-04-20 14:46 UTC (permalink / raw)
To: David Lebrun; +Cc: davem, dlebrun, kuznet, yoshfuji, netdev, linux-kernel
In-Reply-To: <3627f25a-0c47-6428-aa76-5baf96993a4c@gmail.com>
On Fri, 20 Apr 2018 15:38:08 +0100
David Lebrun <dav.lebrun@gmail.com> wrote:
> On 04/20/2018 02:58 PM, Ahmed Abdelsalam wrote:
> > In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
> > in order to set the src addr of outer IPv6 header.
> >
> > The net_device is required for set_tun_src(). However calling ip6_dst_idev()
> > on dst_entry in case of IPv4 traffic results on the following bug.
> >
> > Using just dst->dev should fix this BUG.
> >
>
> Good catch, thanks for spotting this. If you actually tested your fix
> with IPv4 and IPv6 traffic, you should mention it in the commit message.
> Your current formulation suggests that you just guessed a fix without
> testing.
>
Yes, I did two tests for both IPv4 and IPv6.
Sorry for this Language Bug.
> >
> > Fixes: 8936ef7604c11 ipv6: sr: fix NULL pointer dereference when setting encap source address
> > Signed-off-by: Ahmed Abdelsalam<amsalam20@gmail.com>
>
> Acked-by: David Lebrun <dlebrun@google.com>
--
Ahmed Abdelsalam <amsalam20@gmail.com>
^ permalink raw reply
* Re: [PATCH] net: qrtr: Expose tunneling endpoint to user space
From: David Miller @ 2018-04-20 14:40 UTC (permalink / raw)
To: bjorn.andersson; +Cc: linux-kernel, netdev, linux-arm-msm, clew
In-Reply-To: <20180419050346.17054-1-bjorn.andersson@linaro.org>
From: Bjorn Andersson <bjorn.andersson@linaro.org>
Date: Wed, 18 Apr 2018 22:03:46 -0700
> +struct qrtr_tun {
> + struct qrtr_endpoint ep;
> +
> + struct mutex queue_lock;
> + struct sk_buff_head queue;
> + wait_queue_head_t readq;
> +};
The queue lock is surperfluous. sk_buff_head and all of the helpers you
are using does it's own locking. So you are essentially using two sets
of locks to protect the same object.
^ permalink raw reply
* Re: [PATCH] [net] ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
From: David Lebrun @ 2018-04-20 14:38 UTC (permalink / raw)
To: Ahmed Abdelsalam, davem, dlebrun, kuznet, yoshfuji, netdev,
linux-kernel
In-Reply-To: <1524232685-1203-1-git-send-email-amsalam20@gmail.com>
On 04/20/2018 02:58 PM, Ahmed Abdelsalam wrote:
> In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
> in order to set the src addr of outer IPv6 header.
>
> The net_device is required for set_tun_src(). However calling ip6_dst_idev()
> on dst_entry in case of IPv4 traffic results on the following bug.
>
> Using just dst->dev should fix this BUG.
>
Good catch, thanks for spotting this. If you actually tested your fix
with IPv4 and IPv6 traffic, you should mention it in the commit message.
Your current formulation suggests that you just guessed a fix without
testing.
>
> Fixes: 8936ef7604c11 ipv6: sr: fix NULL pointer dereference when setting encap source address
> Signed-off-by: Ahmed Abdelsalam<amsalam20@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox