Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v4 2/3] ipv4: Namespaceify tcp_fastopen_key knob
From: 严海双 @ 2017-09-26  1:25 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, edumazet, weiwan, lucab, netdev, linux-kernel
In-Reply-To: <20170925.162445.234890912211240693.davem@davemloft.net>



> On 2017年9月26日, at 上午7:24, David Miller <davem@davemloft.net> wrote:
> 
> From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> Date: Fri, 22 Sep 2017 21:48:43 +0800
> 
>> @@ -9,13 +9,18 @@
>> #include <net/inetpeer.h>
>> #include <net/tcp.h>
>> 
>> -struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
>> -
>> -static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock);
>> -
>> -void tcp_fastopen_init_key_once(bool publish)
>> +void tcp_fastopen_init_key_once(struct net *net)
> 
> Why did you remove the 'publish' logic from this function?
> 

I think this logic is not necessary now, in proc_tcp_fastopen_key, I have removed 
tcp_fastopen_init_key_once(false) where the ‘publish’ is false:

-		/* Generate a dummy secret but don't publish it. This
-		 * is needed so we don't regenerate a new key on the
-		 * first invocation of tcp_fastopen_cookie_gen
-		 */
-		tcp_fastopen_init_key_once(false);
-		tcp_fastopen_reset_cipher(user_key, TCP_FASTOPEN_KEY_LENGTH);
+		tcp_fastopen_reset_cipher(net, user_key, TCP_FASTOPEN_KEY_LENGTH);

It said we don't regenerate a new key on first invocation of tcp_fastopen_cookie_gen, 
but in tcp_fastopen_cookie_gen，it didn’t  call tcp_fastopen_init_key_once since
from commit dfea2aa654243 (tcp: Do not call tcp_fastopen_reset_cipher from interrupt context)：

And in other places where call tcp_fastopen_init_key_once, the ‘publish’ is always true:

--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -222,7 +222,7 @@ int inet_listen(struct socket *sock, int backlog)
		    (tcp_fastopen & TFO_SERVER_ENABLE) &&
		    !inet_csk(sk)->icsk_accept_queue.fastopenq.max_qlen) {
			fastopen_queue_tune(sk, backlog);
-			tcp_fastopen_init_key_once(true);
+			tcp_fastopen_init_key_once(sock_net(sk));
		}

--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2749,7 +2749,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
	case TCP_FASTOPEN:
		if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE |
		    TCPF_LISTEN))) {
-			tcp_fastopen_init_key_once(true);
+			tcp_fastopen_init_key_once(net);

			fastopen_queue_tune(sk, val);
		} else {


So I deleted ‘publish’ logic to ensure it was always true.

^ permalink raw reply

* Re: [PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()
From: Wei Wang @ 2017-09-26  1:16 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: David Miller, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <20170926005618.z4ymlry7jl47nkjm@kafai-mbp.dhcp.thefacebook.com>

On Mon, Sep 25, 2017 at 5:56 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> On Mon, Sep 25, 2017 at 05:35:22PM +0000, Wei Wang wrote:
>> From: Wei Wang <weiwan@google.com>
>>
>> fib6_del() generates WARN_ON() when rt->dst.obsolete > 0. This does not
>> make sense because it is possible that the route passed in is already
>> deleted by some other thread and rt->dst.obsolete is set to
>> DST_OBSOLETE_DEAD.
>> So this commit deletes this WARN_ON() and also remove the
>> "#ifdef RT6_DEBUG >= 2" condition so that if the route is already
>> obsolete, we return right at the beginning of fib6_del().
>>
>>
>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> index e5308d7cbd75..693bcd7ef6d2 100644
>> --- a/net/ipv6/ip6_fib.c
>> +++ b/net/ipv6/ip6_fib.c
>> @@ -1592,13 +1592,7 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info)
>>       struct net *net = info->nl_net;
>>       struct rt6_info **rtp;
>>
>> -#if RT6_DEBUG >= 2
>> -     if (rt->dst.obsolete > 0) {
>> -             WARN_ON(fn);
> fn should have already been set to NULL if it is removed
> from the fib6 tree?
>

That is true. rt->rt6i_node (fn) should already be marked as NULL.
That means the check on rt->dst.obsolete is redundant.
I will remove it in v2.
Thanks Martin.


>> -             return -ENOENT;
>> -     }
>> -#endif
>> -     if (!fn || rt == net->ipv6.ip6_null_entry)
>> +     if (!fn || rt->dst.obsolete > 0 || rt == net->ipv6.ip6_null_entry)
>>               return -ENOENT;
>>
>>       WARN_ON(!(fn->fn_flags & RTN_RTINFO));
>> --
>> 2.14.1.821.g8fa685d3b7-goog
>>

^ permalink raw reply

* Re: [PATCH] r8152: add Linksys USB3GIGV1 id
From: Grant Grundler @ 2017-09-26  1:12 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Oliver Neukum, Hayes Wang, David S . Miller, LKML, linux-usb,
	netdev
In-Reply-To: <CANEJEGtM+gFC9Ofmp=UmGn5pKys9NRbsC6+ks_VqaLKWkEBS8A@mail.gmail.com>

On Mon, Sep 25, 2017 at 1:17 PM, Grant Grundler <grundler@chromium.org> wrote:
...
> I didn't realize cdc_ether has a blacklist to make sure
> RTL8152|RTL8153 devices are not picked up by cdc_ether. Would you
> prefer I add this device to the blacklist in the same patch?

I've sent a V2 which also updates the blacklist in cdc_ether.

cheers,
grant

^ permalink raw reply

* [PATCH V2] r8152:  add Linksys USB3GIGV1 id
From: Grant Grundler @ 2017-09-26  1:09 UTC (permalink / raw)
  To: Hayes Wang; +Cc: linux-usb, David S . Miller, LKML, netdev, Grant Grundler

This linksys dongle by default comes up in cdc_ether mode.
This patch allows r8152 to claim the device:
   Bus 002 Device 002: ID 13b1:0041 Linksys

Signed-off-by: Grant Grundler <grundler@chromium.org>
---
 drivers/net/usb/cdc_ether.c | 8 ++++++++
 drivers/net/usb/r8152.c     | 2 ++
 2 files changed, 10 insertions(+)

V2: add LINKSYS_VENDOR_ID to cdc_ether blacklist

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 8ab281b478f2..fa5c2e7aff1a 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -546,6 +546,7 @@ static const struct driver_info wwan_info = {
 #define DELL_VENDOR_ID		0x413C
 #define REALTEK_VENDOR_ID	0x0bda
 #define SAMSUNG_VENDOR_ID	0x04e8
+#define LINKSYS_VENDOR_ID	0x13b1
 #define LENOVO_VENDOR_ID	0x17ef
 #define NVIDIA_VENDOR_ID	0x0955
 #define HP_VENDOR_ID		0x03f0
@@ -737,6 +738,13 @@ static const struct usb_device_id	products[] = {
 	.driver_info = 0,
 },
 
+/* Linksys USB3GIGV1 Ethernet Adapter */
+{
+	USB_DEVICE_AND_INTERFACE_INFO(LINKSYS_VENDOR_ID, 0x0041, USB_CLASS_COMM,
+			USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+	.driver_info = 0,
+},
+
 /* ThinkPad USB-C Dock (based on Realtek RTL8153) */
 {
 	USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x3062, USB_CLASS_COMM,
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index ceb78e2ea4f0..941ece08ba78 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -613,6 +613,7 @@ enum rtl8152_flags {
 #define VENDOR_ID_MICROSOFT		0x045e
 #define VENDOR_ID_SAMSUNG		0x04e8
 #define VENDOR_ID_LENOVO		0x17ef
+#define VENDOR_ID_LINKSYS		0x13b1
 #define VENDOR_ID_NVIDIA		0x0955
 
 #define MCU_TYPE_PLA			0x0100
@@ -5316,6 +5317,7 @@ static const struct usb_device_id rtl8152_table[] = {
 	{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7205)},
 	{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x720c)},
 	{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7214)},
+	{REALTEK_USB_DEVICE(VENDOR_ID_LINKSYS, 0x0041)},
 	{REALTEK_USB_DEVICE(VENDOR_ID_NVIDIA,  0x09ff)},
 	{}
 };
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related

* Re: [PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()
From: Martin KaFai Lau @ 2017-09-26  0:56 UTC (permalink / raw)
  To: Wei Wang; +Cc: David Miller, netdev, Eric Dumazet
In-Reply-To: <20170925173522.99892-1-tracywwnj@gmail.com>

On Mon, Sep 25, 2017 at 05:35:22PM +0000, Wei Wang wrote:
> From: Wei Wang <weiwan@google.com>
> 
> fib6_del() generates WARN_ON() when rt->dst.obsolete > 0. This does not
> make sense because it is possible that the route passed in is already
> deleted by some other thread and rt->dst.obsolete is set to
> DST_OBSOLETE_DEAD.
> So this commit deletes this WARN_ON() and also remove the
> "#ifdef RT6_DEBUG >= 2" condition so that if the route is already
> obsolete, we return right at the beginning of fib6_del().
> 
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index e5308d7cbd75..693bcd7ef6d2 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -1592,13 +1592,7 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info)
>  	struct net *net = info->nl_net;
>  	struct rt6_info **rtp;
>  
> -#if RT6_DEBUG >= 2
> -	if (rt->dst.obsolete > 0) {
> -		WARN_ON(fn);
fn should have already been set to NULL if it is removed
from the fib6 tree?

> -		return -ENOENT;
> -	}
> -#endif
> -	if (!fn || rt == net->ipv6.ip6_null_entry)
> +	if (!fn || rt->dst.obsolete > 0 || rt == net->ipv6.ip6_null_entry)
>  		return -ENOENT;
>  
>  	WARN_ON(!(fn->fn_flags & RTN_RTINFO));
> -- 
> 2.14.1.821.g8fa685d3b7-goog
> 

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Daniel Borkmann @ 2017-09-25 23:47 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <CAM_iQpUKURMfZOyTMbgVFG43rgUg8-3yAAVtEmPja5T6xkZpjA@mail.gmail.com>

On 09/26/2017 01:11 AM, Cong Wang wrote:
> On Mon, Sep 25, 2017 at 2:16 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 09/25/2017 07:13 PM, Cong Wang wrote:
[...]
>> There's this condition in the code before above idr allocations,
>> I think also in other classifiers:
>>
>>          if (oldprog) {
>>                  if (handle && oldprog->handle != handle) {
>>                          ret = -EINVAL;
>>                          goto errout;
>>                  }
>>          }
>
> Sure. If we use handle to find oldprog, it should have the
> same handle. cls_bpf_get() guarantees it. This check is
> redundant.

Good point, we should just test for 'oldprog && oldprog->handle !=
handle' and bail out then, otherwise it's just irritating. I can
see to fix this up later. Seems fine to me then, thanks!

^ permalink raw reply

* Re: [PATCH] [for 4.14] net: qcom/emac: specify the correct size when mapping a DMA buffer
From: David Miller @ 2017-09-25 23:35 UTC (permalink / raw)
  To: timur; +Cc: netdev, stable
In-Reply-To: <1506112364-26787-1-git-send-email-timur@codeaurora.org>

From: Timur Tabi <timur@codeaurora.org>
Date: Fri, 22 Sep 2017 15:32:44 -0500

> When mapping the RX DMA buffers, the driver was accidentally specifying
> zero for the buffer length.  Under normal circumstances, SWIOTLB does not
> need to allocate a bounce buffer, so the address is just mapped without
> checking the size field.  This is why the error was not detected earlier.
> 
> Fixes: b9b17debc69d ("net: emac: emac gigabit ethernet controller driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Timur Tabi <timur@codeaurora.org>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [net-next] net: remove MTU limits for dummy and ifb device
From: David Miller @ 2017-09-25 23:26 UTC (permalink / raw)
  To: zhangshengju; +Cc: sd, edumazet, netdev
In-Reply-To: <1506095869-5465-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Fri, 22 Sep 2017 23:57:49 +0800

> These two drivers (dummy and ifb) call ether_setup(), after commit
> 61e84623ace3 ("net: centralize net_device min/max MTU checking"), the
> range of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.
> 
> These two devices should not have limits on MTU. This patch set their
> min_mtu/max_mtu to 0. So that dev_set_mtu() will not check the mtu range,
> and can be set with any value.
> 
> CC: Eric Dumazet <edumazet@google.com>
> CC: Sabrina Dubroca <sd@queasysnail.net>
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied.

^ permalink raw reply

* Re: [PATCH] hv_netvsc: make const array ver_list static, reduces object code size
From: David Miller @ 2017-09-25 23:25 UTC (permalink / raw)
  To: colin.king
  Cc: kys, haiyangz, sthemmin, devel, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20170922155023.15821-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Fri, 22 Sep 2017 16:50:23 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Don't populate const array ver_list on the stack, instead make it
> static. Makes the object code smaller by over 400 bytes:
> 
> Before:
>    text	   data	    bss	    dec	    hex	filename
>   18444	   3168	    320	  21932	   55ac	drivers/net/hyperv/netvsc.o
> 
> After:
>    text	   data	    bss	    dec	    hex	filename
>   17950	   3224	    320	  21494	   53f6	drivers/net/hyperv/netvsc.o
> 
> (gcc 6.3.0, x86-64)
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH v4 2/3] ipv4: Namespaceify tcp_fastopen_key knob
From: David Miller @ 2017-09-25 23:24 UTC (permalink / raw)
  To: yanhaishuang; +Cc: kuznet, edumazet, weiwan, lucab, netdev, linux-kernel
In-Reply-To: <1506088124-12650-2-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Fri, 22 Sep 2017 21:48:43 +0800

> @@ -9,13 +9,18 @@
>  #include <net/inetpeer.h>
>  #include <net/tcp.h>
>  
> -struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
> -
> -static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock);
> -
> -void tcp_fastopen_init_key_once(bool publish)
> +void tcp_fastopen_init_key_once(struct net *net)

Why did you remove the 'publish' logic from this function?

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate u32 filter handles
From: Cong Wang @ 2017-09-25 23:13 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925190031.GB1936@nanopsycho.orion>

On Mon, Sep 25, 2017 at 12:00 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Interesting, any idea why this is not 0x7FFFFFFF as well?
>
> I wonder if we could have 0x7FFFFFFF magic defined somewhere.

I have no idea, it just exists for a rather long time. Probably too late
to change, or at least requires a separate patch to change it.

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 23:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <59C97244.30301@iogearbox.net>

On Mon, Sep 25, 2017 at 2:16 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 09/25/2017 07:13 PM, Cong Wang wrote:
>>         ret = cls_bpf_offload(tp, prog, oldprog);
>>         if (ret) {
>> +               if (!oldprog)
>> +                       idr_remove_ext(&head->handle_idr, prog->handle);
>
>
> Shouldn't we also call idr_remove_ext() when there was an
> oldprog, but we didn't care about reusing the same handle,
> so it was handle == 0 initially?

When oldprog is non-NULL, we are replacing the oldprog with
a new one, therefore we should call idr_replace_ext() which
happens after this. So no need to call idr_remove_ext() at
this point.



>
> There's this condition in the code before above idr allocations,
> I think also in other classifiers:
>
>         if (oldprog) {
>                 if (handle && oldprog->handle != handle) {
>                         ret = -EINVAL;
>                         goto errout;
>                 }
>         }

Sure. If we use handle to find oldprog, it should have the
same handle. cls_bpf_get() guarantees it. This check is
redundant.

>
>>                 __cls_bpf_delete_prog(prog);
>>                 return ret;
>>         }
>> @@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct
>> sk_buff *in_skb,
>>                 prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>>
>>         if (oldprog) {
>> +               idr_replace_ext(&head->handle_idr, prog, handle);
>
>
> And here, we should probably use prog->handle for the above
> mentioned case as well, no?

Since are replacing oldprog with a new one, prog->handle is
same with handle.


>
> Would be great if all this (and e.g. the fact that we use idr itself)
> could optionally be hidden behind some handle generator api given
> we could reuse that api also for cls_basic and cls_u32. Could also
> be followed-up perhaps.
>

Yeah, the idr_alloc_ext(.., handle, handle+1,) is ugly. Ideally we should
specify the range during initialization rather than in each idr_alloc_ext().
Commit c15ab236d69d already did the same thing. We can refactor
this later.

^ permalink raw reply

* [PATCH net] net: dsa: Fix network device registration order
From: Florian Fainelli @ 2017-09-25 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli, Andrew Lunn, Vivien Didelot, open list

We cannot be registering the network device first, then setting its
carrier off and finally connecting it to a PHY, doing that leaves a
window during which the carrier is at best inconsistent, and at worse
the device is not usable without a down/up sequence since the network
device is visible to user space with possibly no PHY device attached.

Re-order steps so that they make logical sense. This fixes some devices
where the port was not usable after e.g: an unbind then bind of the
driver.

Fixes: 0071f56e46da ("dsa: Register netdev before phy")
Fixes: 91da11f870f0 ("net: Distributed Switch Architecture protocol support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 net/dsa/slave.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 2afa99506f8b..865e29e62bad 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1301,28 +1301,33 @@ int dsa_slave_create(struct dsa_port *port, const char *name)
 	p->old_duplex = -1;
 
 	port->netdev = slave_dev;
-	ret = register_netdev(slave_dev);
-	if (ret) {
-		netdev_err(master, "error %d registering interface %s\n",
-			   ret, slave_dev->name);
-		port->netdev = NULL;
-		free_percpu(p->stats64);
-		free_netdev(slave_dev);
-		return ret;
-	}
 
 	netif_carrier_off(slave_dev);
 
 	ret = dsa_slave_phy_setup(p, slave_dev);
 	if (ret) {
 		netdev_err(master, "error %d setting up slave phy\n", ret);
-		unregister_netdev(slave_dev);
-		free_percpu(p->stats64);
-		free_netdev(slave_dev);
-		return ret;
+		goto out_free;
+	}
+
+	ret = register_netdev(slave_dev);
+	if (ret) {
+		netdev_err(master, "error %d registering interface %s\n",
+			   ret, slave_dev->name);
+		goto out_phy;
 	}
 
 	return 0;
+
+out_phy:
+	phy_disconnect(p->phy);
+	if (of_phy_is_fixed_link(p->dp->dn))
+		of_phy_deregister_fixed_link(p->dp->dn);
+out_free:
+	free_percpu(p->stats64);
+	free_netdev(slave_dev);
+	port->netdev = NULL;
+	return ret;
 }
 
 void dsa_slave_destroy(struct net_device *slave_dev)
-- 
2.9.3

^ permalink raw reply related

* [jkirsher/next-queue PATCH] ixgbe: Update adaptive ITR algorithm
From: Alexander Duyck @ 2017-09-25 21:55 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: john.fastabend, brouer

From: Alexander Duyck <alexander.h.duyck@intel.com>

The following change is meant to update the adaptive ITR algorithm to
better support the needs of the network. Specifically with this change what
I have done is make it so that our ITR algorithm will try to prevent either
starving a socket buffer for memory in the case of Tx, or overruing an Rx
socket buffer on receive.

In addition a side effect of the calculations used is that we should
function better with new features such as XDP which can handle small
packets at high rates without needing to lock us into NAPI polling mode.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

So I am putting this out to a wider distribution list than normal for a
patch like this in order to get feedback on if there are any areas I may
have overlooked. With this patch is should address many of the performance
limitations seen with pktgen and XDP in terms of workloads that the old
adaptive scheme wasn't handling.

 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    7 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |   11 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  215 +++++++++++++++++++------
 3 files changed, 178 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 56039d04b38d..555eb80d8a08 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -435,8 +435,15 @@ static inline unsigned int ixgbe_rx_pg_order(struct ixgbe_ring *ring)
 }
 #define ixgbe_rx_pg_size(_ring) (PAGE_SIZE << ixgbe_rx_pg_order(_ring))
 
+#define IXGBE_ITR_ADAPTIVE_MIN_INC	2
+#define IXGBE_ITR_ADAPTIVE_MIN_USECS	10
+#define IXGBE_ITR_ADAPTIVE_MAX_USECS	126
+#define IXGBE_ITR_ADAPTIVE_LATENCY	0x80
+#define IXGBE_ITR_ADAPTIVE_BULK		0x00
+
 struct ixgbe_ring_container {
 	struct ixgbe_ring *ring;	/* pointer to linked list of rings */
+	unsigned long next_update;	/* jiffies value of last update */
 	unsigned int total_bytes;	/* total bytes processed this int */
 	unsigned int total_packets;	/* total packets processed this int */
 	u16 work_limit;			/* total work allowed per interrupt */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index f1bfae0c41d0..8e2a957aca18 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -806,6 +806,7 @@ static void ixgbe_add_ring(struct ixgbe_ring *ring,
 	ring->next = head->ring;
 	head->ring = ring;
 	head->count++;
+	head->next_update = jiffies + 1;
 }
 
 /**
@@ -879,8 +880,11 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 	/* initialize work limits */
 	q_vector->tx.work_limit = adapter->tx_work_limit;
 
-	/* initialize pointer to rings */
-	ring = q_vector->ring;
+	/* Initialize setting for adaptive ITR */
+	q_vector->tx.itr = IXGBE_ITR_ADAPTIVE_MAX_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
+	q_vector->rx.itr = IXGBE_ITR_ADAPTIVE_MAX_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
 
 	/* intialize ITR */
 	if (txr_count && !rxr_count) {
@@ -897,6 +901,9 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 			q_vector->itr = adapter->rx_itr_setting;
 	}
 
+	/* initialize pointer to rings */
+	ring = q_vector->ring;
+
 	while (txr_count) {
 		/* assign generic ring traits */
 		ring->dev = &adapter->pdev->dev;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3d3739f103af..44a96878075b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2517,50 +2517,174 @@ enum latency_range {
 static void ixgbe_update_itr(struct ixgbe_q_vector *q_vector,
 			     struct ixgbe_ring_container *ring_container)
 {
-	int bytes = ring_container->total_bytes;
-	int packets = ring_container->total_packets;
-	u32 timepassed_us;
-	u64 bytes_perint;
-	u8 itr_setting = ring_container->itr;
+	unsigned int itr = IXGBE_ITR_ADAPTIVE_MIN_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
+	unsigned int avg_wire_size, packets, bytes;
+	unsigned long next_update = jiffies;
 
-	if (packets == 0)
+	/* If we don't have any rings just leave ourselves set for maximum
+	 * possible latency so we take ourselves out of the equation.
+	 */
+	if (!ring_container->ring)
 		return;
 
-	/* simple throttlerate management
-	 *   0-10MB/s   lowest (100000 ints/s)
-	 *  10-20MB/s   low    (20000 ints/s)
-	 *  20-1249MB/s bulk   (12000 ints/s)
+	/* If we didn't update within up to 1 - 2 jiffies we can assume
+	 * that either packets are coming in so slow there hasn't been
+	 * any work, or that there is so much work that NAPI is dealing
+	 * with interrupt moderation and we don't need to do anything.
 	 */
-	/* what was last interrupt timeslice? */
-	timepassed_us = q_vector->itr >> 2;
-	if (timepassed_us == 0)
-		return;
+	if (time_after(next_update, ring_container->next_update))
+		goto clear_counts;
 
-	bytes_perint = bytes / timepassed_us; /* bytes/usec */
+	packets = ring_container->total_packets;
 
-	switch (itr_setting) {
-	case lowest_latency:
-		if (bytes_perint > 10)
-			itr_setting = low_latency;
-		break;
-	case low_latency:
-		if (bytes_perint > 20)
-			itr_setting = bulk_latency;
-		else if (bytes_perint <= 10)
-			itr_setting = lowest_latency;
+	/* We have no packets to actually measure against. This means
+	 * either one of the other queues on this vector is active or
+	 * we are a Tx queue doing TSO with too high of an interrupt rate.
+	 *
+	 * When this occurs just tick up our delay by the minimum value
+	 * and hope that this extra delay will prevent us from being called
+	 * without any work on our queue.
+	 */
+	if (!packets) {
+		itr = (q_vector->itr >> 2) + IXGBE_ITR_ADAPTIVE_MIN_INC;
+		if (itr > IXGBE_ITR_ADAPTIVE_MAX_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MAX_USECS;
+		itr += ring_container->itr & IXGBE_ITR_ADAPTIVE_LATENCY;
+		goto clear_counts;
+	}
+
+	bytes = ring_container->total_bytes;
+
+	/* If packets are less than 4 or bytes are less than 9000 assume
+	 * insufficient data to use bulk rate limiting approach. We are
+	 * likely latency driven.
+	 */
+	if (packets < 4 && bytes < 9000) {
+		itr = IXGBE_ITR_ADAPTIVE_LATENCY;
+		goto adjust_by_size;
+	}
+
+	/* Between 4 and 48 we can assume that our current interrupt delay
+	 * is only slightly too low. As such we should increase it by a small
+	 * fixed amount.
+	 */
+	if (packets < 48) {
+		itr = (q_vector->itr >> 2) + IXGBE_ITR_ADAPTIVE_MIN_INC;
+		if (itr > IXGBE_ITR_ADAPTIVE_MAX_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MAX_USECS;
+		goto clear_counts;
+	}
+
+	/* Between 48 and 96 is our "goldilocks" zone where we are working
+	 * out "just right". Just report that our current ITR is good for us.
+	 */
+	if (packets < 96) {
+		itr = q_vector->itr >> 2;
+		goto clear_counts;
+	}
+
+	/* If packet count is 96 or greater we are likely looking at a slight
+	 * overrun of the delay we want. Try halving our delay to see if that
+	 * will cut the number of packets in half per interrupt.
+	 */
+	if (packets < 256) {
+		itr = q_vector->itr >> 3;
+		if (itr < IXGBE_ITR_ADAPTIVE_MIN_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MIN_USECS;
+		goto clear_counts;
+	}
+
+	/* The paths below assume we are dealing with a bulk ITR since number
+	 * of packets is 256 or greater. We are just going to have to compute
+	 * a value and try to bring the count under control, though for smaller
+	 * packet sizes there isn't much we can do as NAPI polling will likely
+	 * be kicking in sooner rather than later.
+	 */
+	itr = IXGBE_ITR_ADAPTIVE_BULK;
+
+adjust_by_size:
+	/* If packet counts are 256 or greater we can assume we have a gross
+	 * overestimation of what the rate should be. Instead of trying to fine
+	 * tune it just use the formula below to try and dial in an exact value
+	 * give the current packet size of the frame.
+	 */
+	avg_wire_size = bytes / packets;
+
+	/* The following is a crude approximation of:
+	 *  wmem_default / (size + overhead) = desired_pkts_per_int
+	 *  rate / bits_per_byte / (size + ethernet overhead) = pkt_rate
+	 *  (desired_pkt_rate / pkt_rate) * usecs_per_sec = ITR value
+	 *
+	 * Assuming wmem_default is 212992 and overhead is 640 bytes per
+	 * packet, (256 skb, 64 headroom, 320 shared info), we can reduce the
+	 * formula down to
+	 *
+	 *  (170 * (size + 24)) / (size + 640) = ITR
+	 *
+	 * We first do some math on the packet size and then finally bitshift
+	 * by 8 after rounding up. We also have to account for PCIe link speed
+	 * difference as ITR scales based on this.
+	 */
+	if (avg_wire_size <= 60) {
+		/* Start at 50k ints/sec */
+		avg_wire_size = 5120;
+	} else if (avg_wire_size <= 316) {
+		/* 50K ints/sec to 16K ints/sec */
+		avg_wire_size *= 40;
+		avg_wire_size += 2720;
+	} else if (avg_wire_size <= 1084) {
+		/* 16K ints/sec to 9.2K ints/sec */
+		avg_wire_size *= 15;
+		avg_wire_size += 11452;
+	} else if (avg_wire_size <= 1980) {
+		/* 9.2K ints/sec to 8K ints/sec */
+		avg_wire_size *= 5;
+		avg_wire_size += 22420;
+	} else {
+		/* plateau at a limit of 8K ints/sec */
+		avg_wire_size = 32256;
+	}
+
+	/* If we are in low latency mode half our delay which doubles the rate
+	 * to somewhere between 100K to 16K ints/sec
+	 */
+	if (itr & IXGBE_ITR_ADAPTIVE_LATENCY)
+		avg_wire_size >>= 1;
+
+	/* Resultant value is 256 times larger than it needs to be. This
+	 * gives us room to adjust the value as needed to either increase
+	 * or decrease the value based on link speeds of 10G, 2.5G, 1G, etc.
+	 *
+	 * Use addition as we have already recorded the new latency flag
+	 * for the ITR value.
+	 */
+	switch (q_vector->adapter->link_speed) {
+	case IXGBE_LINK_SPEED_10GB_FULL:
+	case IXGBE_LINK_SPEED_100_FULL:
+	default:
+		itr += DIV_ROUND_UP(avg_wire_size,
+				    IXGBE_ITR_ADAPTIVE_MIN_INC * 256) *
+		       IXGBE_ITR_ADAPTIVE_MIN_INC;
 		break;
-	case bulk_latency:
-		if (bytes_perint <= 20)
-			itr_setting = low_latency;
+	case IXGBE_LINK_SPEED_2_5GB_FULL:
+	case IXGBE_LINK_SPEED_1GB_FULL:
+	case IXGBE_LINK_SPEED_10_FULL:
+		itr += DIV_ROUND_UP(avg_wire_size,
+				    IXGBE_ITR_ADAPTIVE_MIN_INC * 64) *
+		       IXGBE_ITR_ADAPTIVE_MIN_INC;
 		break;
 	}
 
-	/* clear work counters since we have the values we need */
+clear_counts:
+	/* write back value */
+	ring_container->itr = itr;
+
+	/* next update should occur within next jiffy */
+	ring_container->next_update = next_update + 1;
+
 	ring_container->total_bytes = 0;
 	ring_container->total_packets = 0;
-
-	/* write updated itr to ring container */
-	ring_container->itr = itr_setting;
 }
 
 /**
@@ -2602,34 +2726,19 @@ void ixgbe_write_eitr(struct ixgbe_q_vector *q_vector)
 
 static void ixgbe_set_itr(struct ixgbe_q_vector *q_vector)
 {
-	u32 new_itr = q_vector->itr;
-	u8 current_itr;
+	u32 new_itr;
 
 	ixgbe_update_itr(q_vector, &q_vector->tx);
 	ixgbe_update_itr(q_vector, &q_vector->rx);
 
-	current_itr = max(q_vector->rx.itr, q_vector->tx.itr);
+	/* use the smallest value of new ITR delay calculations */
+	new_itr = min(q_vector->rx.itr, q_vector->tx.itr);
 
-	switch (current_itr) {
-	/* counts and packets in update_itr are dependent on these numbers */
-	case lowest_latency:
-		new_itr = IXGBE_100K_ITR;
-		break;
-	case low_latency:
-		new_itr = IXGBE_20K_ITR;
-		break;
-	case bulk_latency:
-		new_itr = IXGBE_12K_ITR;
-		break;
-	default:
-		break;
-	}
+	/* Clear latency flag if set, shift into correct position */
+	new_itr &= ~IXGBE_ITR_ADAPTIVE_LATENCY;
+	new_itr <<= 2;
 
 	if (new_itr != q_vector->itr) {
-		/* do an exponential smoothing */
-		new_itr = (10 * new_itr * q_vector->itr) /
-			  ((9 * new_itr) + q_vector->itr);
-
 		/* save the algorithm value here */
 		q_vector->itr = new_itr;
 

^ permalink raw reply related

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: David Ahern @ 2017-09-25 21:52 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, stephen
In-Reply-To: <20170925.131101.478664536826772174.davem@davemloft.net>

On 9/25/17 2:11 PM, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Mon, 25 Sep 2017 10:14:23 -0600
> 
>> I made a simple program this morning and ran it under perf.
> 
> If possible please submit this for selftests.
> 

It is more of a microbenchmark of options to flush an rbtree than a
self-test. Further, it relies on the tools/lib/rbtree.c versus
lib/rbtree.c. The tools/lib version was imported by Arnaldo in July 2015
and is a out of date, though it is good enough to show the intent w.r.t.
flushing options.

^ permalink raw reply

* Re: [PATCH net 0/2] l2tp: fix some races in session deletion
From: David Miller @ 2017-09-25 21:45 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman, tparkin, sd
In-Reply-To: <cover.1506086081.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Fri, 22 Sep 2017 15:39:22 +0200

> L2TP provides several interfaces for deleting sessions. Using two of
> them concurrently can lead to use-after-free bugs.
> 
> Patch #2 uses a flag to prevent double removal of L2TP sessions.
> Patch #1 fixes a bug found in the way. Fixing this bug is also
> necessary for patch #2 to handle all cases.
> 
> This issue is similar to the tunnel deletion bug being worked on by
> Sabrina: https://patchwork.ozlabs.org/patch/814173/

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] bpf/verifier: improve disassembly of BPF_END instructions
From: Daniel Borkmann @ 2017-09-25 21:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Y Song
  Cc: Edward Cree, David Miller, netdev, Jiong Wang, Jakub Kicinski
In-Reply-To: <20170924055016.w6x5tj6kjxjbocpl@ast-mbp>

On 09/24/2017 07:50 AM, Alexei Starovoitov wrote:
> On Fri, Sep 22, 2017 at 09:49:10PM -0700, Y Song wrote:
>> On Fri, Sep 22, 2017 at 9:23 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 22/09/17 16:16, Alexei Starovoitov wrote:
>>>> looks like we're converging on
>>>> "be16/be32/be64/le16/le32/le64 #register" for BPF_END.
>>>> I guess it can live with that. I would prefer more C like syntax
>>>> to match the rest, but llvm parsing point is a strong one.
>>> Yep, agreed.  I'll post a v2 once we've settled BPF_NEG.
>>>> For BPG_NEG I prefer to do it in C syntax like interpreter does:
>>>>          ALU_NEG:
>>>>                  DST = (u32) -DST;
>>>>          ALU64_NEG:
>>>>                  DST = -DST;
>>>> Yonghong, does it mean that asmparser will equally suffer?
>>> Correction to my earlier statements: verifier will currently disassemble
>>>   neg as:
>>> (87) r0 neg 0
>>> (84) (u32) r0 neg (u32) 0
>>>   because it pretends 'neg' is a compound-assignment operator like +=.
>>> The analogy with be16 and friends would be to use
>>>      neg64 r0
>>>      neg32 r0
>>>   whereas the analogy with everything else would be
>>>      r0 = -r0
>>>      r0 = (u32) -r0
>>>   as Alexei says.
>>> I'm happy to go with Alexei's version if it doesn't cause problems for llvm.
>>
>> I got some time to do some prototyping in llvm and it looks like that
>> I am able to
>> resolve the issue and we are able to use more C-like syntax. That is:
>> for bswap:
>>       r1 = (be16) (u16) r1
>>       or
>>       r1 = (be16) r1
>>       or
>>       r1 = be16 r1
>> for neg:
>>       r0 = -r0
>>       (for 32bit support, llvm may output "w0 = -w0" in the future. But
>> since it is not
>>        enabled yet, you can continue to output "r0 = (u32) -r0".)
>>
>> Not sure which syntax is best for bswap. The "r1 = (be16) (u16) r1" is most
>> explicit in its intention.
>>
>> Attaching my llvm patch as well and cc'ing Jiong and Jakub so they can see my
>> implementation and the relative discussion here. (In this patch, I did
>> not implement
>> bswap for little endian yet.) Maybe they can provide additional comments.
>
> This is awesome. In such case I'd like to swing back to the C syntax for bpf_end :)
> Any of these
>    r1 = (be16) (u16) r1
>    or
>    r1 = (be16) r1
>    or
>    r1 = be16 r1
> are better than just
>    be16 r1
> I like 1st the most, since it's explicit in terms of what happens with upper bits,
> but 2nd is also ok. 3rd is not quite C-like.

But above cast to be16 also doesn't seem quite C-like in terms
of what we're actually doing... 3rd option would be my personal
preference even if it doesn't look C-like, but otoh we also have
'call' etc which is neither.

^ permalink raw reply

* Re: [patch net] net: dsa: mv88e6xxx: Allow dsa and cpu ports in multiple vlans
From: Vivien Didelot @ 2017-09-25 21:34 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: netdev, Andrew Lunn
In-Reply-To: <1506375140-2853-1-git-send-email-andrew@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

> Ports with the same VLAN must all be in the same bridge. However the
> CPU and DSA ports need to be in multiple VLANs spread over multiple
> bridges. So exclude them when performing this test.
>
> Fixes: b2f81d304cee ("net: dsa: add CPU and DSA ports as VLAN members")
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

^ permalink raw reply

* Re: [PATCH net-next v2] bpf: Optimize lpm trie delete
From: David Miller @ 2017-09-25 21:38 UTC (permalink / raw)
  To: kraigatgoog; +Cc: daniel, ast, daniel, netdev
In-Reply-To: <20170921224329.101928-1-kraigatgoog@gmail.com>

From: Craig Gallek <kraigatgoog@gmail.com>
Date: Thu, 21 Sep 2017 18:43:29 -0400

> From: Craig Gallek <kraig@google.com>
> 
> Before the delete operator was added, this datastructure maintained
> an invariant that intermediate nodes were only present when necessary
> to build the tree.  This patch updates the delete operation to reinstate
> that invariant by removing unnecessary intermediate nodes after a node is
> removed and thus keeping the tree structure at a minimal size.
> 
> Suggested-by: Daniel Mack <daniel@zonque.org>
> Signed-off-by: Craig Gallek <kraig@google.com>

Applied, thank you.

^ permalink raw reply

* [patch net] net: dsa: mv88e6xxx: Allow dsa and cpu ports in multiple vlans
From: Andrew Lunn @ 2017-09-25 21:32 UTC (permalink / raw)
  To: David Miller; +Cc: Vivien Didelot, netdev, Andrew Lunn

Ports with the same VLAN must all be in the same bridge. However the
CPU and DSA ports need to be in multiple VLANs spread over multiple
bridges. So exclude them when performing this test.

Fixes: b2f81d304cee ("net: dsa: add CPU and DSA ports as VLAN members")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index c6678aa9b4ef..674dab71d71c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1100,6 +1100,10 @@ static int mv88e6xxx_port_check_hw_vlan(struct dsa_switch *ds, int port,
 	};
 	int i, err;
 
+	/* DSA and CPU ports have to be members of multiple vlans */
+	if (dsa_is_dsa_port(ds, port) || dsa_is_cpu_port(ds, port))
+		return 0;
+
 	if (!vid_begin)
 		return -EOPNOTSUPP;
 
-- 
2.14.1

^ permalink raw reply related

* WARNING: kernel stack frame pointer at ffff880156a5fea0 in bash:2103 has bad value 00007ffec7d87e50
From: Richard Weinberger @ 2017-09-25 21:23 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linux-kernel

Hi!

While playing with bcc's opensnoop tool on Linux 4.14-rc2 I managed to trigger 
this splat:

[  297.629773] WARNING: kernel stack frame pointer at ffff880156a5fea0 in 
bash:2103 has bad value 00007ffec7d87e50
[  297.629777] unwind stack type:0 next_sp:          (null) mask:0x6 
graph_idx:0
[  297.629783] ffff88015b207ae0: ffff88015b207b68 (0xffff88015b207b68)
[  297.629790] ffff88015b207ae8: ffffffffb163c00e (__save_stack_trace+0x6e/
0xd0)
[  297.629792] ffff88015b207af0: 0000000000000000 ...
[  297.629795] ffff88015b207af8: ffff880156a58000 (0xffff880156a58000)
[  297.629799] ffff88015b207b00: ffff880156a60000 (0xffff880156a60000)
[  297.629800] ffff88015b207b08: 0000000000000000 ...
[  297.629803] ffff88015b207b10: 0000000000000006 (0x6)
[  297.629806] ffff88015b207b18: ffff880151b02700 (0xffff880151b02700)
[  297.629809] ffff88015b207b20: 0000010100000000 (0x10100000000)
[  297.629812] ffff88015b207b28: ffff880156a5fea0 (0xffff880156a5fea0)
[  297.629815] ffff88015b207b30: ffff88015b207ae0 (0xffff88015b207ae0)
[  297.629818] ffff88015b207b38: ffffffffc0050282 (0xffffffffc0050282)
[  297.629819] ffff88015b207b40: 0000000000000000 ...
[  297.629822] ffff88015b207b48: 0000000001000000 (0x1000000)
[  297.629825] ffff88015b207b50: ffff880157b98280 (0xffff880157b98280)
[  297.629828] ffff88015b207b58: ffff880157b98380 (0xffff880157b98380)
[  297.629831] ffff88015b207b60: ffff88015ad2b500 (0xffff88015ad2b500)
[  297.629834] ffff88015b207b68: ffff88015b207b78 (0xffff88015b207b78)
[  297.629838] ffff88015b207b70: ffffffffb163c086 (save_stack_trace+0x16/0x20)
[  297.629841] ffff88015b207b78: ffff88015b207da8 (0xffff88015b207da8)
[  297.629847] ffff88015b207b80: ffffffffb18a8ed6 (save_stack+0x46/0xd0)
[  297.629850] ffff88015b207b88: 000000400000000c (0x400000000c)
[  297.629852] ffff88015b207b90: ffff88015b207ba0 (0xffff88015b207ba0)
[  297.629855] ffff88015b207b98: ffff880100000000 (0xffff880100000000)
[  297.629859] ffff88015b207ba0: ffffffffb163c086 (save_stack_trace+0x16/0x20)
[  297.629864] ffff88015b207ba8: ffffffffb18a8ed6 (save_stack+0x46/0xd0)
[  297.629868] ffff88015b207bb0: ffffffffb18a9752 (kasan_slab_free+0x72/0xc0)
[  297.629873] ffff88015b207bb8: ffffffffb18a5e90 (kmem_cache_free+0x70/0x190)
[  297.629879] ffff88015b207bc0: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.629886] ffff88015b207bc8: ffffffffb172580c (rcu_process_callbacks
+0x2dc/0xcd0)
[  297.629892] ffff88015b207bd0: ffffffffb2646cbc (__do_softirq+0x12c/0x343)
[  297.629897] ffff88015b207bd8: ffffffffb1692304 (irq_exit+0xe4/0xf0)
[  297.629902] ffff88015b207be0: ffffffffb2646446 (smp_apic_timer_interrupt
+0x86/0x1a0)
[  297.629907] ffff88015b207be8: ffffffffb26452f3 (apic_timer_interrupt
+0x93/0xa0)
[  297.629913] ffff88015b207bf0: ffffffffb1667417 (optimized_callback
+0x67/0x100)
[  297.629916] ffff88015b207bf8: ffffffffc0050282 (0xffffffffc0050282)
[  297.629918] ffff88015b207c00: 0000000000000000 ...
[  297.629921] ffff88015b207c08: ffff88015a77e24c (0xffff88015a77e24c)
[  297.629924] ffff88015b207c10: ffff88015b207c38 (0xffff88015b207c38)
[  297.629927] ffff88015b207c18: ffff88015b207c38 (0xffff88015b207c38)
[  297.629929] ffff88015b207c20: 0000000000000086 (0x86)
[  297.629932] ffff88015b207c28: ffff88015a77db00 (0xffff88015a77db00)
[  297.629935] ffff88015b207c30: 1ffff1002b640f91 (0x1ffff1002b640f91)
[  297.629938] ffff88015b207c38: ffff88015b207d10 (0xffff88015b207d10)
[  297.629945] ffff88015b207c40: ffffffffb16c9f60 (try_to_wake_up+0xb0/0x710)
[  297.629947] ffff88015b207c48: 0000000000000000 ...
[  297.629952] ffff88015b207c50: ffffffffb2dfd3c0 (machine_ops+0x40/0x40)
[  297.629954] ffff88015b207c58: ffff88015a77df94 (0xffff88015a77df94)
[  297.629957] ffff88015b207c60: 0000000000023540 (0x23540)
[  297.629960] ffff88015b207c68: ffff88015b215c38 (0xffff88015b215c38)
[  297.629963] ffff88015b207c70: ffff88015b200000 (0xffff88015b200000)
[  297.629965] ffff88015b207c78: 0000000000000086 (0x86)
[  297.629968] ffff88015b207c80: 0000000100000000 (0x100000000)
[  297.629971] ffff88015b207c88: 0000000041b58ab3 (0x41b58ab3)
[  297.629975] ffff88015b207c90: ffffffffb2d919f2 (.LC2+0x6e0e/0x83b5)
[  297.629981] ffff88015b207c98: ffffffffb16c9eb0 (migrate_swap_stop
+0x2e0/0x2e0)
[  297.629986] ffff88015b207ca0: ffffffffb16d0f73 (account_entity_dequeue
+0x73/0x110)
[  297.629989] ffff88015b207ca8: 0000000000100000 (0x100000)
[  297.629992] ffff88015b207cb0: ffff88015b2235a0 (0xffff88015b2235a0)
[  297.629994] ffff88015b207cb8: ffff88015061e280 (0xffff88015061e280)
[  297.629997] ffff88015b207cc0: ffff88015b207ce8 (0xffff88015b207ce8)
[  297.630003] ffff88015b207cc8: ffffffffb16c87ed (sched_avg_update+0x2d/0x90)
[  297.630005] ffff88015b207cd0: 0000000000000005 (0x5)
[  297.630008] ffff88015b207cd8: ffff88015b223570 (0xffff88015b223570)
[  297.630010] ffff88015b207ce0: 00000000000000dd (0xdd)
[  297.630013] ffff88015b207ce8: ffff88015a017ea0 (0xffff88015a017ea0)
[  297.630021] ffff88015b207cf0: ffffffffb30b7128 (rcu_sched_state
+0x928/0xaa0)
[  297.630024] ffff88015b207cf8: ffff880151b02700 (0xffff880151b02700)
[  297.630026] ffff88015b207d00: 0000000000000001 (0x1)
[  297.630031] ffff88015b207d08: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630034] ffff88015b207d10: ffff88015b207d20 (0xffff88015b207d20)
[  297.630040] ffff88015b207d18: ffffffffb16ca5d0 (wake_up_process+0x10/0x20)
[  297.630043] ffff88015b207d20: ffff88015b207d48 (0xffff88015b207d48)
[  297.630045] ffff88015b207d28: ffff88015b207d48 (0xffff88015b207d48)
[  297.630048] ffff88015b207d30: 0000000000000202 (0x202)
[  297.630053] ffff88015b207d38: ffffffffb30b7120 (rcu_sched_state
+0x920/0xaa0)
[  297.630056] ffff88015b207d40: 0000000000000202 (0x202)
[  297.630059] ffff88015b207d48: ffff88015b207d68 (0xffff88015b207d68)
[  297.630063] ffff88015b207d50: ffffffffb16ee225 (swake_up+0x25/0x30)
[  297.630069] ffff88015b207d58: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630072] ffff88015b207d60: ffff88015a77db00 (0xffff88015a77db00)
[  297.630074] ffff88015b207d68: ffff88015b207d90 (0xffff88015b207d90)
[  297.630079] ffff88015b207d70: ffffffffb1720016 (rcu_gp_kthread_wake
+0x56/0x60)
[  297.630082] ffff88015b207d78: 0000000000000002 (0x2)
[  297.630087] ffff88015b207d80: ffffffffb30b7138 (rcu_sched_state
+0x938/0xaa0)
[  297.630092] ffff88015b207d88: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630095] ffff88015b207d90: ffff88015b207e18 (0xffff88015b207e18)
[  297.630099] ffff88015b207d98: ffffffffb1720521 (rcu_report_qs_rnp
+0x2f1/0x310)
[  297.630102] ffff88015b207da0: ffff88015ad2b500 (0xffff88015ad2b500)
[  297.630105] ffff88015b207da8: ffff88015b207dd0 (0xffff88015b207dd0)
[  297.630110] ffff88015b207db0: ffffffffb18a9752 (kasan_slab_free+0x72/0xc0)
[  297.630113] ffff88015b207db8: ffff880157b98280 (0xffff880157b98280)
[  297.630116] ffff88015b207dc0: ffffea00055ee600 (0xffffea00055ee600)
[  297.630121] ffff88015b207dc8: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.630124] ffff88015b207dd0: ffff88015b207e00 (0xffff88015b207e00)
[  297.630128] ffff88015b207dd8: ffffffffb18a5e90 (kmem_cache_free+0x70/0x190)
[  297.630131] ffff88015b207de0: ffff880157b98280 (0xffff880157b98280)
[  297.630135] ffff88015b207de8: ffffffffb18b7e60 (get_max_files+0x10/0x10)
[  297.630141] ffff88015b207df0: ffffffffb30b72a0 (rcu_sched_state
+0xaa0/0xaa0)
[  297.630143] ffff88015b207df8: 000000000000000f (0xf)
[  297.630146] ffff88015b207e00: ffff88015b207e18 (0xffff88015b207e18)
[  297.630150] ffff88015b207e08: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.630153] ffff88015b207e10: ffff880157b98280 (0xffff880157b98280)
[  297.630156] ffff88015b207e18: ffff88015b207f30 (0xffff88015b207f30)
[  297.630161] ffff88015b207e20: ffffffffb172580c (rcu_process_callbacks
+0x2dc/0xcd0)
[  297.630164] ffff88015b207e28: ffff88015b21b000 (0xffff88015b21b000)
[  297.630167] ffff88015b207e30: ffff88015b21b070 (0xffff88015b21b070)
[  297.630170] ffff88015b207e38: 1ffff1002b640fd5 (0x1ffff1002b640fd5)
[  297.630173] ffff88015b207e40: ffff880151b02700 (0xffff880151b02700)
[  297.630176] ffff88015b207e48: ffff88015b224200 (0xffff88015b224200)
[  297.630178] ffff88015b207e50: ffff88015b224280 (0xffff88015b224280)
[  297.630181] ffff88015b207e58: ffff88015b2242b0 (0xffff88015b2242b0)
[  297.630184] ffff88015b207e60: ffff88015b207f08 (0xffff88015b207f08)
[  297.630187] ffff88015b207e68: ffff880151b0274c (0xffff880151b0274c)
[  297.630190] ffff88015b207e70: ffff880151b02700 (0xffff880151b02700)
[  297.630195] ffff88015b207e78: ffffffffb30b7258 (rcu_sched_state
+0xa58/0xaa0)
[  297.630198] ffff88015b207e80: ffff880157b98288 (0xffff880157b98288)
[  297.630203] ffff88015b207e88: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630206] ffff88015b207e90: ffff88015b224238 (0xffff88015b224238)
[  297.630209] ffff88015b207e98: ffff88015b207ec8 (0xffff88015b207ec8)
[  297.630211] ffff88015b207ea0: 000000000000000a (0xa)
[  297.630214] ffff88015b207ea8: 0000000041b58ab3 (0x41b58ab3)
[  297.630218] ffff88015b207eb0: ffffffffb2d944f5 (.LC0+0x155c/0xa3a6)
[  297.630223] ffff88015b207eb8: ffffffffb1725530 (note_gp_changes+0xe0/0xe0)
[  297.630226] ffff88015b207ec0: ffff88015b215740 (0xffff88015b215740)
[  297.630229] ffff88015b207ec8: ffff880157b983c0 (0xffff880157b983c0)
[  297.630231] ffff88015b207ed0: ffff88014ac19eb0 (0xffff88014ac19eb0)
[  297.630234] ffff88015b207ed8: ffffffffffffffff (0xffffffffffffffff)
[  297.630236] ffff88015b207ee0: 0000000000000000 ...
[  297.630239] ffff88015b207ee8: 0000004552dda1c0 (0x4552dda1c0)
[  297.630240] ffff88015b207ef0: 0000000000000000 ...
[  297.630243] ffff88015b207ef8: ffff88015b207f20 (0xffff88015b207f20)
[  297.630249] ffff88015b207f00: ffffffffb174a0a8 (tick_program_event
+0x48/0x80)
[  297.630252] ffff88015b207f08: 0000000000000009 (0x9)
[  297.630259] ffff88015b207f10: ffffffffb3009148 (softirq_vec+0x48/0x80)
[  297.630261] ffff88015b207f18: 0000000000000009 (0x9)
[  297.630263] ffff88015b207f20: 0000000000000008 (0x8)
[  297.630265] ffff88015b207f28: 0000000000000009 (0x9)
[  297.630268] ffff88015b207f30: ffff88015b207fa8 (0xffff88015b207fa8)
[  297.630273] ffff88015b207f38: ffffffffb2646cbc (__do_softirq+0x12c/0x343)
[  297.630276] ffff88015b207f40: 0000000a00404100 (0xa00404100)
[  297.630279] ffff88015b207f48: ffff880151b02700 (0xffff880151b02700)
[  297.630282] ffff88015b207f50: 00000000fffff730 (0xfffff730)
[  297.630284] ffff88015b207f58: 0000000000000009 (0x9)
[  297.630286] ffff88015b207f60: 0000000000000040 (0x40)
[  297.630289] ffff88015b207f68: 000001005b21c294 (0x1005b21c294)
[  297.630294] ffff88015b207f70: ffffffffb3009110 (softirq_vec+0x10/0x80)
[  297.630297] ffff88015b207f78: 0000008000000008 (0x8000000008)
[  297.630300] ffff88015b207f80: ffff88015a77ce00 (0xffff88015a77ce00)
[  297.630303] ffff88015b207f88: ffff88015b215840 (0xffff88015b215840)
[  297.630304] ffff88015b207f90: 0000000000000000 ...
[  297.630307] ffff88015b207f98: ffff880156a5feb0 (0xffff880156a5feb0)
[  297.630311] ffff88015b207fa0: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630314] ffff88015b207fa8: ffff88015b207fc0 (0xffff88015b207fc0)
[  297.630318] ffff88015b207fb0: ffffffffb1692304 (irq_exit+0xe4/0xf0)
[  297.630321] ffff88015b207fb8: ffff88015b215740 (0xffff88015b215740)
[  297.630324] ffff88015b207fc0: ffff88015b207fe8 (0xffff88015b207fe8)
[  297.630329] ffff88015b207fc8: ffffffffb2646446 (smp_apic_timer_interrupt
+0x86/0x1a0)
[  297.630332] ffff88015b207fd0: ffff88015104d500 (0xffff88015104d500)
[  297.630335] ffff88015b207fd8: ffff88015b215840 (0xffff88015b215840)
[  297.630338] ffff88015b207fe0: 0000000000000246 (0x246)
[  297.630341] ffff88015b207fe8: ffff880156a5fdc9 (0xffff880156a5fdc9)
[  297.630345] ffff88015b207ff0: ffffffffb26452f3 (apic_timer_interrupt
+0x93/0xa0)
[  297.630348] ffff88015b207ff8: ffff880156a5fdc8 (0xffff880156a5fdc8)
[  297.630352] ffff880156a5fdc8: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630355] ffff880156a5fdd0: ffff880156a5feb0 (0xffff880156a5feb0)
[  297.630357] ffff880156a5fdd8: 0000000000000246 (0x246)
[  297.630360] ffff880156a5fde0: ffff88015b215840 (0xffff88015b215840)
[  297.630363] ffff880156a5fde8: ffff880156a5fea0 (0xffff880156a5fea0)
[  297.630366] ffff880156a5fdf0: ffff88015104d500 (0xffff88015104d500)
[  297.630369] ffff880156a5fdf8: fffff52000140c08 (0xfffff52000140c08)
[  297.630372] ffff880156a5fe00: ffffc90000a0603f (0xffffc90000a0603f)
[  297.630375] ffff880156a5fe08: fffff52000140c07 (0xfffff52000140c07)
[  297.630378] ffff880156a5fe10: fffff52000140c08 (0xfffff52000140c08)
[  297.630379] ffff880156a5fe18: 0000000000000000 ...
[  297.630385] ffff880156a5fe20: ffffffffb178d9eb (opt_pre_handler+0x6b/0x80)
[  297.630388] ffff880156a5fe28: dffffc0000000000 (0xdffffc0000000000)
[  297.630391] ffff880156a5fe30: dffffc0000000000 (0xdffffc0000000000)
[  297.630393] ffff880156a5fe38: 0000000000000246 (0x246)
[  297.630396] ffff880156a5fe40: ffffffffffffff10 (0xffffffffffffff10)
[  297.630401] ffff880156a5fe48: ffffffffb1667417 (optimized_callback
+0x67/0x100)
[  297.630404] ffff880156a5fe50: 0000000000000010 (0x10)
[  297.630406] ffff880156a5fe58: 0000000000000246 (0x246)
[  297.630409] ffff880156a5fe60: ffff880156a5fe78 (0xffff880156a5fe78)
[  297.630412] ffff880156a5fe68: 0000000000000018 (0x18)
[  297.630414] ffff880156a5fe70: 0000000000000246 (0x246)
[  297.630417] ffff880156a5fe78: 00000000026aed08 (0x26aed08)
[  297.630419] ffff880156a5fe80: 0000000000000005 (0x5)
[  297.630421] ffff880156a5fe88: 0000000000000003 (0x3)
[  297.630423] ffff880156a5fe90: 0000000000000000 ...
[  297.630425] ffff880156a5fe98: 00000000025d1568 (0x25d1568)
[  297.630428] ffff880156a5fea0: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630431] ffff880156a5fea8: ffffffffc0050282 (0xffffffffc0050282)
[  297.630433] ffff880156a5feb0: 00000000025d1568 (0x25d1568)
[  297.630435] ffff880156a5feb8: 0000000000000000 ...
[  297.630437] ffff880156a5fec0: 0000000000000003 (0x3)
[  297.630440] ffff880156a5fec8: 0000000000000005 (0x5)
[  297.630442] ffff880156a5fed0: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630445] ffff880156a5fed8: 00000000026aed08 (0x26aed08)
[  297.630448] ffff880156a5fee0: ffff880151b02700 (0xffff880151b02700)
[  297.630450] ffff880156a5fee8: 0000000002675e00 (0x2675e00)
[  297.630453] ffff880156a5fef0: 0000000000000001 (0x1)
[  297.630455] ffff880156a5fef8: 0000000000000002 (0x2)
[  297.630457] ffff880156a5ff00: 0000000000000002 (0x2)
[  297.630460] ffff880156a5ff08: 0000000002675e00 (0x2675e00)
[  297.630462] ffff880156a5ff10: 0000000000000180 (0x180)
[  297.630464] ffff880156a5ff18: 0000000000000000 ...
[  297.630466] ffff880156a5ff20: 000000000272a008 (0x272a008)
[  297.630469] ffff880156a5ff28: ffffffffffffffff (0xffffffffffffffff)
[  297.630473] ffff880156a5ff30: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630475] ffff880156a5ff38: 0000000000000010 (0x10)
[  297.630478] ffff880156a5ff40: 0000000000000293 (0x293)
[  297.630481] ffff880156a5ff48: ffff880156a5ff50 (0xffff880156a5ff50)
[  297.630485] ffff880156a5ff50: ffffffffb1665770 (copy_oldmem_page+0x90/0x90)
[  297.630488] ffff880156a5ff58: 00000000025d1b28 (0x25d1b28)
[  297.630489] ffff880156a5ff60: 0000000000000000 ...
[  297.630492] ffff880156a5ff68: 0000000000000003 (0x3)
[  297.630494] ffff880156a5ff70: 0000000000000005 (0x5)
[  297.630497] ffff880156a5ff78: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630499] ffff880156a5ff80: 00000000026aed08 (0x26aed08)
[  297.630502] ffff880156a5ff88: 0000000000000246 (0x246)
[  297.630504] ffff880156a5ff90: 0000000002675e00 (0x2675e00)
[  297.630506] ffff880156a5ff98: 0000000000000001 (0x1)
[  297.630509] ffff880156a5ffa0: 0000000000000002 (0x2)
[  297.630511] ffff880156a5ffa8: ffffffffffffffda (0xffffffffffffffda)
[  297.630514] ffff880156a5ffb0: 00007f3d3f7be4e0 (0x7f3d3f7be4e0)
[  297.630517] ffff880156a5ffb8: 0000000000000180 (0x180)
[  297.630518] ffff880156a5ffc0: 0000000000000000 ...
[  297.630521] ffff880156a5ffc8: 000000000272a008 (0x272a008)
[  297.630523] ffff880156a5ffd0: 0000000000000002 (0x2)
[  297.630526] ffff880156a5ffd8: 00007f3d3f7be4e0 (0x7f3d3f7be4e0)
[  297.630528] ffff880156a5ffe0: 0000000000000033 (0x33)
[  297.630530] ffff880156a5ffe8: 0000000000000246 (0x246)
[  297.630533] ffff880156a5fff0: 00007ffec7d87db8 (0x7ffec7d87db8)
[  297.630535] ffff880156a5fff8: 000000000000002b (0x2b)

opensnoop(pythong) itself blocks too:

root@test:~# cat /proc/2075/stack
[<ffffffffb79a0a07>] ring_buffer_wait+0x167/0x2e0
[<ffffffffb79a34e7>] wait_on_pipe+0x77/0x80
[<ffffffffb79aa7a1>] tracing_wait_pipe.isra.69+0x51/0xf0
[<ffffffffb79abdf9>] tracing_read_pipe+0x1c9/0x500
[<ffffffffb7ab5e62>] __vfs_read+0xd2/0x370
[<ffffffffb7ab61b7>] vfs_read+0xb7/0x1a0
[<ffffffffb7ab6bd0>] SyS_read+0xa0/0x120
[<ffffffffb8843c37>] entry_SYSCALL_64_fastpath+0x1a/0xa5
[<ffffffffffffffff>] 0xffffffffffffffff

Thanks,
//richard

-- 
sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria
ATU66964118 - FN 374287y

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Daniel Borkmann @ 2017-09-25 21:16 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925171351.4956-1-xiyou.wangcong@gmail.com>

On 09/25/2017 07:13 PM, Cong Wang wrote:
> Instead of calling cls_bpf_get() in a loop to find
> a unused handle, just switch to idr API to allocate
> new handles.
>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Chris Mi <chrism@mellanox.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
[...]
> @@ -476,21 +462,30 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   		}
>   	}
>
> -	if (handle == 0)
> -		prog->handle = cls_bpf_grab_new_handle(tp, head);
> -	else
> +	if (handle == 0) {
> +		ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
> +				    1, 0x7FFFFFFF, GFP_KERNEL);
> +		if (ret)
> +			goto errout;
> +		prog->handle = idr_index;
> +	} else {
> +		if (!oldprog) {
> +			ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
> +					    handle, handle + 1, GFP_KERNEL);
> +			if (ret)
> +				goto errout;
> +		}
>   		prog->handle = handle;
> -	if (prog->handle == 0) {
> -		ret = -EINVAL;
> -		goto errout;
>   	}
>
>   	ret = cls_bpf_set_parms(net, tp, prog, base, tb, tca[TCA_RATE], ovr);
>   	if (ret < 0)
> -		goto errout;
> +		goto errout_idr;
>
>   	ret = cls_bpf_offload(tp, prog, oldprog);
>   	if (ret) {
> +		if (!oldprog)
> +			idr_remove_ext(&head->handle_idr, prog->handle);

Shouldn't we also call idr_remove_ext() when there was an
oldprog, but we didn't care about reusing the same handle,
so it was handle == 0 initially?

There's this condition in the code before above idr allocations,
I think also in other classifiers:

         if (oldprog) {
                 if (handle && oldprog->handle != handle) {
                         ret = -EINVAL;
                         goto errout;
                 }
         }

>   		__cls_bpf_delete_prog(prog);
>   		return ret;
>   	}
> @@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>
>   	if (oldprog) {
> +		idr_replace_ext(&head->handle_idr, prog, handle);

And here, we should probably use prog->handle for the above
mentioned case as well, no?

Would be great if all this (and e.g. the fact that we use idr itself)
could optionally be hidden behind some handle generator api given
we could reuse that api also for cls_basic and cls_u32. Could also
be followed-up perhaps.

>   		list_replace_rcu(&oldprog->link, &prog->link);
>   		tcf_unbind_filter(tp, &oldprog->res);
>   		call_rcu(&oldprog->rcu, cls_bpf_delete_prog_rcu);
> @@ -509,6 +505,9 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   	*arg = prog;
>   	return 0;
>
> +errout_idr:
> +	if (!oldprog)
> +		idr_remove_ext(&head->handle_idr, prog->handle);

(Likewise as the failing cls_bpf_offload().)

>   errout:
>   	tcf_exts_destroy(&prog->exts);
>   	kfree(prog);
>

^ permalink raw reply

* [PATCH net-next] net: ipv6: send NS for DAD when link operationally up
From: Mike Manning @ 2017-09-25 21:01 UTC (permalink / raw)
  To: netdev

The NS for DAD are sent on admin up as long as a valid qdisc is found.
A race condition exists by which these packets will not egress the
interface if the operational state of the lower device is not yet up.
The solution is to delay DAD until the link is operationally up
according to RFC2863. Rather than only doing this, follow the existing
code checks by deferring IPv6 device initialization altogether. The fix
allows DAD on devices like tunnels that are controlled by userspace
control plane. The fix has no impact on regular deployments, but means
that there is no IPv6 connectivity until the port has been opened in
the case of port-based network access control, which should be
desirable.

Signed-off-by: Mike Manning <mmanning@brocade.com>
---
 net/ipv6/addrconf.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c2e2a78..dffbf3b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -303,10 +303,10 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
 	.disable_policy		= 0,
 };
 
-/* Check if a valid qdisc is available */
-static inline bool addrconf_qdisc_ok(const struct net_device *dev)
+/* Check if link is ready: is it up and is a valid qdisc available */
+static inline bool addrconf_link_ready(const struct net_device *dev)
 {
-	return !qdisc_tx_is_noop(dev);
+	return netif_oper_up(dev) && !qdisc_tx_is_noop(dev);
 }
 
 static void addrconf_del_rs_timer(struct inet6_dev *idev)
@@ -451,7 +451,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 
 	ndev->token = in6addr_any;
 
-	if (netif_running(dev) && addrconf_qdisc_ok(dev))
+	if (netif_running(dev) && addrconf_link_ready(dev))
 		ndev->if_flags |= IF_READY;
 
 	ipv6_mc_init_dev(ndev);
@@ -3393,7 +3393,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 			/* restore routes for permanent addresses */
 			addrconf_permanent_addr(dev);
 
-			if (!addrconf_qdisc_ok(dev)) {
+			if (!addrconf_link_ready(dev)) {
 				/* device is not ready yet. */
 				pr_info("ADDRCONF(NETDEV_UP): %s: link is not ready\n",
 					dev->name);
@@ -3408,7 +3408,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 				run_pending = 1;
 			}
 		} else if (event == NETDEV_CHANGE) {
-			if (!addrconf_qdisc_ok(dev)) {
+			if (!addrconf_link_ready(dev)) {
 				/* device is still not ready. */
 				break;
 			}
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v2 0/2] enable hires timer to timeout datagram socket
From: Thomas Gleixner @ 2017-09-25 20:42 UTC (permalink / raw)
  To: Vallish Vaidyeshwara
  Cc: Eric Dumazet, Eduardo Valentin, David Miller, dwmw2, shuah,
	richardcochran, xiyou.wangcong, netdev, linux-kernel, anchalag,
	dwmw
In-Reply-To: <20170920224816.GA73561@amazon.com>

On Wed, 20 Sep 2017, Vallish Vaidyeshwara wrote:
> On Sat, Sep 16, 2017 at 11:47:56AM +0200, Thomas Gleixner wrote:
> > > So if we need to replace all 'legacy' timers to high resolution timer,
> > > because some application was _relying_ on jiffies being kind of precise,
> > > maybe it is better to revert the change done on legacy timers.
> > 
> > Which would be a major step back in terms of timer performance and system
> > disturbance caused by massive recascading operations.
> > 
> > > Or continue the migration and make them use high res internally.
> > > 
> > > select() and poll() are the standard way to have precise timeouts,
> > > it is silly we have to maintain a timeout handling in the datagram fast
> > > path.
> > 
> > A few years ago we switched select/poll over to use hrtimers because the
> > wheel timers were too inaccurate for some operations, so it feels
> > consequent to switch the timeout in the datagram rcv path over as well. I
> > agree that the whole timeout magic there feels silly, but unfortunately
> > it's a documented property of sockets.
> > 
> 
> Thanks for your comments. This patch has been NACK'ed by David Miller. Is
> there any other approach to solve this problem with out application code
> being recompiled?

We have only three options here:

   1) Do a massive revert of the timer wheel changes and lose all the
      benefits of that rework.

   2) Make that timer list -> hrtimer change in the datagram code

   3) Ignore it

#1 Would be pretty ironic as networking would take the biggest penalty of
   the revert.

#2 Is IMO the proper solution as it cures a user space visible regression,
   though the patch itself could be made way simpler

#3 Shrug

Dave, Eric?

Thanks,

	tglx

^ permalink raw reply

* Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
From: Paolo Abeni @ 2017-09-25 20:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> > 
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> > 
> > To perform this tasks a couple of facilities are added:
> > 
> > * noref socket references, scoped inside the current RCU section, to be
> >   explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> >   related local dst entry
> > 
> > The measured performance gain under small packet UDP flood is as follow:
> > 
> > ingress NIC	vanilla		patched		delta
> > rx queues	(kpps)		(kpps)		(%)
> > [ipv4]
> > 1		2177		2414		10
> > 2		2527		2892		14
> > 3		3050		3733		22
> 
> 
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
> 
> return socket[RX_QUEUE# | or CPU#];
> 
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
> 
> return socket[hash(skb)];
> 
> Multiple cpus can then :
>  - compete on grabbing same socket refcount
>  - compete on grabbing the receive queue lock
>  - compete for releasing lock and socket refcount
>  - skb freeing done on different cpus than where allocated.
> 
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
> 
> First solve the false sharing issue.
> 
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
> 
> Then we can see if the gains you claim are still applicable.

Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:

ingress NIC     vanilla         patched         delta
rx queues       (kpps)          (kpps)          (%)
[ipv4]
2               3020                3663                21
3               4352                5179                19
4               5318                6194                16
5               6258                7583                21
6               7376                8558                16

[ipv6]
2               2446                3949                61
3               3099                5092                64
4               3698                6611                78
5               4382                7852                79
6               5116                8851                73

Sone notes:

- figures obtained with: 

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        [ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
        udp_sink  --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &
        taskset -p $((MASK << ($I + $n) )) $!
done

- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.

- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.

> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.

Interesting, looking forward to that!

Cheers,

Paolo

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox