Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Daniel Borkmann @ 2017-09-25 23:47 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <CAM_iQpUKURMfZOyTMbgVFG43rgUg8-3yAAVtEmPja5T6xkZpjA@mail.gmail.com>

On 09/26/2017 01:11 AM, Cong Wang wrote:
> On Mon, Sep 25, 2017 at 2:16 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 09/25/2017 07:13 PM, Cong Wang wrote:
[...]
>> There's this condition in the code before above idr allocations,
>> I think also in other classifiers:
>>
>>          if (oldprog) {
>>                  if (handle && oldprog->handle != handle) {
>>                          ret = -EINVAL;
>>                          goto errout;
>>                  }
>>          }
>
> Sure. If we use handle to find oldprog, it should have the
> same handle. cls_bpf_get() guarantees it. This check is
> redundant.

Good point, we should just test for 'oldprog && oldprog->handle !=
handle' and bail out then, otherwise it's just irritating. I can
see to fix this up later. Seems fine to me then, thanks!

^ permalink raw reply

* Re: [PATCH] [for 4.14] net: qcom/emac: specify the correct size when mapping a DMA buffer
From: David Miller @ 2017-09-25 23:35 UTC (permalink / raw)
  To: timur; +Cc: netdev, stable
In-Reply-To: <1506112364-26787-1-git-send-email-timur@codeaurora.org>

From: Timur Tabi <timur@codeaurora.org>
Date: Fri, 22 Sep 2017 15:32:44 -0500

> When mapping the RX DMA buffers, the driver was accidentally specifying
> zero for the buffer length.  Under normal circumstances, SWIOTLB does not
> need to allocate a bounce buffer, so the address is just mapped without
> checking the size field.  This is why the error was not detected earlier.
> 
> Fixes: b9b17debc69d ("net: emac: emac gigabit ethernet controller driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Timur Tabi <timur@codeaurora.org>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [net-next] net: remove MTU limits for dummy and ifb device
From: David Miller @ 2017-09-25 23:26 UTC (permalink / raw)
  To: zhangshengju; +Cc: sd, edumazet, netdev
In-Reply-To: <1506095869-5465-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Fri, 22 Sep 2017 23:57:49 +0800

> These two drivers (dummy and ifb) call ether_setup(), after commit
> 61e84623ace3 ("net: centralize net_device min/max MTU checking"), the
> range of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.
> 
> These two devices should not have limits on MTU. This patch set their
> min_mtu/max_mtu to 0. So that dev_set_mtu() will not check the mtu range,
> and can be set with any value.
> 
> CC: Eric Dumazet <edumazet@google.com>
> CC: Sabrina Dubroca <sd@queasysnail.net>
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied.

^ permalink raw reply

* Re: [PATCH] hv_netvsc: make const array ver_list static, reduces object code size
From: David Miller @ 2017-09-25 23:25 UTC (permalink / raw)
  To: colin.king
  Cc: kys, haiyangz, sthemmin, devel, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20170922155023.15821-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Fri, 22 Sep 2017 16:50:23 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Don't populate const array ver_list on the stack, instead make it
> static. Makes the object code smaller by over 400 bytes:
> 
> Before:
>    text	   data	    bss	    dec	    hex	filename
>   18444	   3168	    320	  21932	   55ac	drivers/net/hyperv/netvsc.o
> 
> After:
>    text	   data	    bss	    dec	    hex	filename
>   17950	   3224	    320	  21494	   53f6	drivers/net/hyperv/netvsc.o
> 
> (gcc 6.3.0, x86-64)
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH v4 2/3] ipv4: Namespaceify tcp_fastopen_key knob
From: David Miller @ 2017-09-25 23:24 UTC (permalink / raw)
  To: yanhaishuang; +Cc: kuznet, edumazet, weiwan, lucab, netdev, linux-kernel
In-Reply-To: <1506088124-12650-2-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Fri, 22 Sep 2017 21:48:43 +0800

> @@ -9,13 +9,18 @@
>  #include <net/inetpeer.h>
>  #include <net/tcp.h>
>  
> -struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
> -
> -static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock);
> -
> -void tcp_fastopen_init_key_once(bool publish)
> +void tcp_fastopen_init_key_once(struct net *net)

Why did you remove the 'publish' logic from this function?

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate u32 filter handles
From: Cong Wang @ 2017-09-25 23:13 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925190031.GB1936@nanopsycho.orion>

On Mon, Sep 25, 2017 at 12:00 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Interesting, any idea why this is not 0x7FFFFFFF as well?
>
> I wonder if we could have 0x7FFFFFFF magic defined somewhere.

I have no idea, it just exists for a rather long time. Probably too late
to change, or at least requires a separate patch to change it.

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 23:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <59C97244.30301@iogearbox.net>

On Mon, Sep 25, 2017 at 2:16 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 09/25/2017 07:13 PM, Cong Wang wrote:
>>         ret = cls_bpf_offload(tp, prog, oldprog);
>>         if (ret) {
>> +               if (!oldprog)
>> +                       idr_remove_ext(&head->handle_idr, prog->handle);
>
>
> Shouldn't we also call idr_remove_ext() when there was an
> oldprog, but we didn't care about reusing the same handle,
> so it was handle == 0 initially?

When oldprog is non-NULL, we are replacing the oldprog with
a new one, therefore we should call idr_replace_ext() which
happens after this. So no need to call idr_remove_ext() at
this point.



>
> There's this condition in the code before above idr allocations,
> I think also in other classifiers:
>
>         if (oldprog) {
>                 if (handle && oldprog->handle != handle) {
>                         ret = -EINVAL;
>                         goto errout;
>                 }
>         }

Sure. If we use handle to find oldprog, it should have the
same handle. cls_bpf_get() guarantees it. This check is
redundant.

>
>>                 __cls_bpf_delete_prog(prog);
>>                 return ret;
>>         }
>> @@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct
>> sk_buff *in_skb,
>>                 prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>>
>>         if (oldprog) {
>> +               idr_replace_ext(&head->handle_idr, prog, handle);
>
>
> And here, we should probably use prog->handle for the above
> mentioned case as well, no?

Since are replacing oldprog with a new one, prog->handle is
same with handle.


>
> Would be great if all this (and e.g. the fact that we use idr itself)
> could optionally be hidden behind some handle generator api given
> we could reuse that api also for cls_basic and cls_u32. Could also
> be followed-up perhaps.
>

Yeah, the idr_alloc_ext(.., handle, handle+1,) is ugly. Ideally we should
specify the range during initialization rather than in each idr_alloc_ext().
Commit c15ab236d69d already did the same thing. We can refactor
this later.

^ permalink raw reply

* [PATCH net] net: dsa: Fix network device registration order
From: Florian Fainelli @ 2017-09-25 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli, Andrew Lunn, Vivien Didelot, open list

We cannot be registering the network device first, then setting its
carrier off and finally connecting it to a PHY, doing that leaves a
window during which the carrier is at best inconsistent, and at worse
the device is not usable without a down/up sequence since the network
device is visible to user space with possibly no PHY device attached.

Re-order steps so that they make logical sense. This fixes some devices
where the port was not usable after e.g: an unbind then bind of the
driver.

Fixes: 0071f56e46da ("dsa: Register netdev before phy")
Fixes: 91da11f870f0 ("net: Distributed Switch Architecture protocol support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 net/dsa/slave.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 2afa99506f8b..865e29e62bad 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1301,28 +1301,33 @@ int dsa_slave_create(struct dsa_port *port, const char *name)
 	p->old_duplex = -1;
 
 	port->netdev = slave_dev;
-	ret = register_netdev(slave_dev);
-	if (ret) {
-		netdev_err(master, "error %d registering interface %s\n",
-			   ret, slave_dev->name);
-		port->netdev = NULL;
-		free_percpu(p->stats64);
-		free_netdev(slave_dev);
-		return ret;
-	}
 
 	netif_carrier_off(slave_dev);
 
 	ret = dsa_slave_phy_setup(p, slave_dev);
 	if (ret) {
 		netdev_err(master, "error %d setting up slave phy\n", ret);
-		unregister_netdev(slave_dev);
-		free_percpu(p->stats64);
-		free_netdev(slave_dev);
-		return ret;
+		goto out_free;
+	}
+
+	ret = register_netdev(slave_dev);
+	if (ret) {
+		netdev_err(master, "error %d registering interface %s\n",
+			   ret, slave_dev->name);
+		goto out_phy;
 	}
 
 	return 0;
+
+out_phy:
+	phy_disconnect(p->phy);
+	if (of_phy_is_fixed_link(p->dp->dn))
+		of_phy_deregister_fixed_link(p->dp->dn);
+out_free:
+	free_percpu(p->stats64);
+	free_netdev(slave_dev);
+	port->netdev = NULL;
+	return ret;
 }
 
 void dsa_slave_destroy(struct net_device *slave_dev)
-- 
2.9.3

^ permalink raw reply related

* [jkirsher/next-queue PATCH] ixgbe: Update adaptive ITR algorithm
From: Alexander Duyck @ 2017-09-25 21:55 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: john.fastabend, brouer

From: Alexander Duyck <alexander.h.duyck@intel.com>

The following change is meant to update the adaptive ITR algorithm to
better support the needs of the network. Specifically with this change what
I have done is make it so that our ITR algorithm will try to prevent either
starving a socket buffer for memory in the case of Tx, or overruing an Rx
socket buffer on receive.

In addition a side effect of the calculations used is that we should
function better with new features such as XDP which can handle small
packets at high rates without needing to lock us into NAPI polling mode.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

So I am putting this out to a wider distribution list than normal for a
patch like this in order to get feedback on if there are any areas I may
have overlooked. With this patch is should address many of the performance
limitations seen with pktgen and XDP in terms of workloads that the old
adaptive scheme wasn't handling.

 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    7 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |   11 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  215 +++++++++++++++++++------
 3 files changed, 178 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 56039d04b38d..555eb80d8a08 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -435,8 +435,15 @@ static inline unsigned int ixgbe_rx_pg_order(struct ixgbe_ring *ring)
 }
 #define ixgbe_rx_pg_size(_ring) (PAGE_SIZE << ixgbe_rx_pg_order(_ring))
 
+#define IXGBE_ITR_ADAPTIVE_MIN_INC	2
+#define IXGBE_ITR_ADAPTIVE_MIN_USECS	10
+#define IXGBE_ITR_ADAPTIVE_MAX_USECS	126
+#define IXGBE_ITR_ADAPTIVE_LATENCY	0x80
+#define IXGBE_ITR_ADAPTIVE_BULK		0x00
+
 struct ixgbe_ring_container {
 	struct ixgbe_ring *ring;	/* pointer to linked list of rings */
+	unsigned long next_update;	/* jiffies value of last update */
 	unsigned int total_bytes;	/* total bytes processed this int */
 	unsigned int total_packets;	/* total packets processed this int */
 	u16 work_limit;			/* total work allowed per interrupt */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index f1bfae0c41d0..8e2a957aca18 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -806,6 +806,7 @@ static void ixgbe_add_ring(struct ixgbe_ring *ring,
 	ring->next = head->ring;
 	head->ring = ring;
 	head->count++;
+	head->next_update = jiffies + 1;
 }
 
 /**
@@ -879,8 +880,11 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 	/* initialize work limits */
 	q_vector->tx.work_limit = adapter->tx_work_limit;
 
-	/* initialize pointer to rings */
-	ring = q_vector->ring;
+	/* Initialize setting for adaptive ITR */
+	q_vector->tx.itr = IXGBE_ITR_ADAPTIVE_MAX_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
+	q_vector->rx.itr = IXGBE_ITR_ADAPTIVE_MAX_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
 
 	/* intialize ITR */
 	if (txr_count && !rxr_count) {
@@ -897,6 +901,9 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 			q_vector->itr = adapter->rx_itr_setting;
 	}
 
+	/* initialize pointer to rings */
+	ring = q_vector->ring;
+
 	while (txr_count) {
 		/* assign generic ring traits */
 		ring->dev = &adapter->pdev->dev;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3d3739f103af..44a96878075b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2517,50 +2517,174 @@ enum latency_range {
 static void ixgbe_update_itr(struct ixgbe_q_vector *q_vector,
 			     struct ixgbe_ring_container *ring_container)
 {
-	int bytes = ring_container->total_bytes;
-	int packets = ring_container->total_packets;
-	u32 timepassed_us;
-	u64 bytes_perint;
-	u8 itr_setting = ring_container->itr;
+	unsigned int itr = IXGBE_ITR_ADAPTIVE_MIN_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
+	unsigned int avg_wire_size, packets, bytes;
+	unsigned long next_update = jiffies;
 
-	if (packets == 0)
+	/* If we don't have any rings just leave ourselves set for maximum
+	 * possible latency so we take ourselves out of the equation.
+	 */
+	if (!ring_container->ring)
 		return;
 
-	/* simple throttlerate management
-	 *   0-10MB/s   lowest (100000 ints/s)
-	 *  10-20MB/s   low    (20000 ints/s)
-	 *  20-1249MB/s bulk   (12000 ints/s)
+	/* If we didn't update within up to 1 - 2 jiffies we can assume
+	 * that either packets are coming in so slow there hasn't been
+	 * any work, or that there is so much work that NAPI is dealing
+	 * with interrupt moderation and we don't need to do anything.
 	 */
-	/* what was last interrupt timeslice? */
-	timepassed_us = q_vector->itr >> 2;
-	if (timepassed_us == 0)
-		return;
+	if (time_after(next_update, ring_container->next_update))
+		goto clear_counts;
 
-	bytes_perint = bytes / timepassed_us; /* bytes/usec */
+	packets = ring_container->total_packets;
 
-	switch (itr_setting) {
-	case lowest_latency:
-		if (bytes_perint > 10)
-			itr_setting = low_latency;
-		break;
-	case low_latency:
-		if (bytes_perint > 20)
-			itr_setting = bulk_latency;
-		else if (bytes_perint <= 10)
-			itr_setting = lowest_latency;
+	/* We have no packets to actually measure against. This means
+	 * either one of the other queues on this vector is active or
+	 * we are a Tx queue doing TSO with too high of an interrupt rate.
+	 *
+	 * When this occurs just tick up our delay by the minimum value
+	 * and hope that this extra delay will prevent us from being called
+	 * without any work on our queue.
+	 */
+	if (!packets) {
+		itr = (q_vector->itr >> 2) + IXGBE_ITR_ADAPTIVE_MIN_INC;
+		if (itr > IXGBE_ITR_ADAPTIVE_MAX_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MAX_USECS;
+		itr += ring_container->itr & IXGBE_ITR_ADAPTIVE_LATENCY;
+		goto clear_counts;
+	}
+
+	bytes = ring_container->total_bytes;
+
+	/* If packets are less than 4 or bytes are less than 9000 assume
+	 * insufficient data to use bulk rate limiting approach. We are
+	 * likely latency driven.
+	 */
+	if (packets < 4 && bytes < 9000) {
+		itr = IXGBE_ITR_ADAPTIVE_LATENCY;
+		goto adjust_by_size;
+	}
+
+	/* Between 4 and 48 we can assume that our current interrupt delay
+	 * is only slightly too low. As such we should increase it by a small
+	 * fixed amount.
+	 */
+	if (packets < 48) {
+		itr = (q_vector->itr >> 2) + IXGBE_ITR_ADAPTIVE_MIN_INC;
+		if (itr > IXGBE_ITR_ADAPTIVE_MAX_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MAX_USECS;
+		goto clear_counts;
+	}
+
+	/* Between 48 and 96 is our "goldilocks" zone where we are working
+	 * out "just right". Just report that our current ITR is good for us.
+	 */
+	if (packets < 96) {
+		itr = q_vector->itr >> 2;
+		goto clear_counts;
+	}
+
+	/* If packet count is 96 or greater we are likely looking at a slight
+	 * overrun of the delay we want. Try halving our delay to see if that
+	 * will cut the number of packets in half per interrupt.
+	 */
+	if (packets < 256) {
+		itr = q_vector->itr >> 3;
+		if (itr < IXGBE_ITR_ADAPTIVE_MIN_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MIN_USECS;
+		goto clear_counts;
+	}
+
+	/* The paths below assume we are dealing with a bulk ITR since number
+	 * of packets is 256 or greater. We are just going to have to compute
+	 * a value and try to bring the count under control, though for smaller
+	 * packet sizes there isn't much we can do as NAPI polling will likely
+	 * be kicking in sooner rather than later.
+	 */
+	itr = IXGBE_ITR_ADAPTIVE_BULK;
+
+adjust_by_size:
+	/* If packet counts are 256 or greater we can assume we have a gross
+	 * overestimation of what the rate should be. Instead of trying to fine
+	 * tune it just use the formula below to try and dial in an exact value
+	 * give the current packet size of the frame.
+	 */
+	avg_wire_size = bytes / packets;
+
+	/* The following is a crude approximation of:
+	 *  wmem_default / (size + overhead) = desired_pkts_per_int
+	 *  rate / bits_per_byte / (size + ethernet overhead) = pkt_rate
+	 *  (desired_pkt_rate / pkt_rate) * usecs_per_sec = ITR value
+	 *
+	 * Assuming wmem_default is 212992 and overhead is 640 bytes per
+	 * packet, (256 skb, 64 headroom, 320 shared info), we can reduce the
+	 * formula down to
+	 *
+	 *  (170 * (size + 24)) / (size + 640) = ITR
+	 *
+	 * We first do some math on the packet size and then finally bitshift
+	 * by 8 after rounding up. We also have to account for PCIe link speed
+	 * difference as ITR scales based on this.
+	 */
+	if (avg_wire_size <= 60) {
+		/* Start at 50k ints/sec */
+		avg_wire_size = 5120;
+	} else if (avg_wire_size <= 316) {
+		/* 50K ints/sec to 16K ints/sec */
+		avg_wire_size *= 40;
+		avg_wire_size += 2720;
+	} else if (avg_wire_size <= 1084) {
+		/* 16K ints/sec to 9.2K ints/sec */
+		avg_wire_size *= 15;
+		avg_wire_size += 11452;
+	} else if (avg_wire_size <= 1980) {
+		/* 9.2K ints/sec to 8K ints/sec */
+		avg_wire_size *= 5;
+		avg_wire_size += 22420;
+	} else {
+		/* plateau at a limit of 8K ints/sec */
+		avg_wire_size = 32256;
+	}
+
+	/* If we are in low latency mode half our delay which doubles the rate
+	 * to somewhere between 100K to 16K ints/sec
+	 */
+	if (itr & IXGBE_ITR_ADAPTIVE_LATENCY)
+		avg_wire_size >>= 1;
+
+	/* Resultant value is 256 times larger than it needs to be. This
+	 * gives us room to adjust the value as needed to either increase
+	 * or decrease the value based on link speeds of 10G, 2.5G, 1G, etc.
+	 *
+	 * Use addition as we have already recorded the new latency flag
+	 * for the ITR value.
+	 */
+	switch (q_vector->adapter->link_speed) {
+	case IXGBE_LINK_SPEED_10GB_FULL:
+	case IXGBE_LINK_SPEED_100_FULL:
+	default:
+		itr += DIV_ROUND_UP(avg_wire_size,
+				    IXGBE_ITR_ADAPTIVE_MIN_INC * 256) *
+		       IXGBE_ITR_ADAPTIVE_MIN_INC;
 		break;
-	case bulk_latency:
-		if (bytes_perint <= 20)
-			itr_setting = low_latency;
+	case IXGBE_LINK_SPEED_2_5GB_FULL:
+	case IXGBE_LINK_SPEED_1GB_FULL:
+	case IXGBE_LINK_SPEED_10_FULL:
+		itr += DIV_ROUND_UP(avg_wire_size,
+				    IXGBE_ITR_ADAPTIVE_MIN_INC * 64) *
+		       IXGBE_ITR_ADAPTIVE_MIN_INC;
 		break;
 	}
 
-	/* clear work counters since we have the values we need */
+clear_counts:
+	/* write back value */
+	ring_container->itr = itr;
+
+	/* next update should occur within next jiffy */
+	ring_container->next_update = next_update + 1;
+
 	ring_container->total_bytes = 0;
 	ring_container->total_packets = 0;
-
-	/* write updated itr to ring container */
-	ring_container->itr = itr_setting;
 }
 
 /**
@@ -2602,34 +2726,19 @@ void ixgbe_write_eitr(struct ixgbe_q_vector *q_vector)
 
 static void ixgbe_set_itr(struct ixgbe_q_vector *q_vector)
 {
-	u32 new_itr = q_vector->itr;
-	u8 current_itr;
+	u32 new_itr;
 
 	ixgbe_update_itr(q_vector, &q_vector->tx);
 	ixgbe_update_itr(q_vector, &q_vector->rx);
 
-	current_itr = max(q_vector->rx.itr, q_vector->tx.itr);
+	/* use the smallest value of new ITR delay calculations */
+	new_itr = min(q_vector->rx.itr, q_vector->tx.itr);
 
-	switch (current_itr) {
-	/* counts and packets in update_itr are dependent on these numbers */
-	case lowest_latency:
-		new_itr = IXGBE_100K_ITR;
-		break;
-	case low_latency:
-		new_itr = IXGBE_20K_ITR;
-		break;
-	case bulk_latency:
-		new_itr = IXGBE_12K_ITR;
-		break;
-	default:
-		break;
-	}
+	/* Clear latency flag if set, shift into correct position */
+	new_itr &= ~IXGBE_ITR_ADAPTIVE_LATENCY;
+	new_itr <<= 2;
 
 	if (new_itr != q_vector->itr) {
-		/* do an exponential smoothing */
-		new_itr = (10 * new_itr * q_vector->itr) /
-			  ((9 * new_itr) + q_vector->itr);
-
 		/* save the algorithm value here */
 		q_vector->itr = new_itr;
 

^ permalink raw reply related

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: David Ahern @ 2017-09-25 21:52 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, stephen
In-Reply-To: <20170925.131101.478664536826772174.davem@davemloft.net>

On 9/25/17 2:11 PM, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Mon, 25 Sep 2017 10:14:23 -0600
> 
>> I made a simple program this morning and ran it under perf.
> 
> If possible please submit this for selftests.
> 

It is more of a microbenchmark of options to flush an rbtree than a
self-test. Further, it relies on the tools/lib/rbtree.c versus
lib/rbtree.c. The tools/lib version was imported by Arnaldo in July 2015
and is a out of date, though it is good enough to show the intent w.r.t.
flushing options.

^ permalink raw reply

* Re: [PATCH net 0/2] l2tp: fix some races in session deletion
From: David Miller @ 2017-09-25 21:45 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman, tparkin, sd
In-Reply-To: <cover.1506086081.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Fri, 22 Sep 2017 15:39:22 +0200

> L2TP provides several interfaces for deleting sessions. Using two of
> them concurrently can lead to use-after-free bugs.
> 
> Patch #2 uses a flag to prevent double removal of L2TP sessions.
> Patch #1 fixes a bug found in the way. Fixing this bug is also
> necessary for patch #2 to handle all cases.
> 
> This issue is similar to the tunnel deletion bug being worked on by
> Sabrina: https://patchwork.ozlabs.org/patch/814173/

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] bpf/verifier: improve disassembly of BPF_END instructions
From: Daniel Borkmann @ 2017-09-25 21:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Y Song
  Cc: Edward Cree, David Miller, netdev, Jiong Wang, Jakub Kicinski
In-Reply-To: <20170924055016.w6x5tj6kjxjbocpl@ast-mbp>

On 09/24/2017 07:50 AM, Alexei Starovoitov wrote:
> On Fri, Sep 22, 2017 at 09:49:10PM -0700, Y Song wrote:
>> On Fri, Sep 22, 2017 at 9:23 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 22/09/17 16:16, Alexei Starovoitov wrote:
>>>> looks like we're converging on
>>>> "be16/be32/be64/le16/le32/le64 #register" for BPF_END.
>>>> I guess it can live with that. I would prefer more C like syntax
>>>> to match the rest, but llvm parsing point is a strong one.
>>> Yep, agreed.  I'll post a v2 once we've settled BPF_NEG.
>>>> For BPG_NEG I prefer to do it in C syntax like interpreter does:
>>>>          ALU_NEG:
>>>>                  DST = (u32) -DST;
>>>>          ALU64_NEG:
>>>>                  DST = -DST;
>>>> Yonghong, does it mean that asmparser will equally suffer?
>>> Correction to my earlier statements: verifier will currently disassemble
>>>   neg as:
>>> (87) r0 neg 0
>>> (84) (u32) r0 neg (u32) 0
>>>   because it pretends 'neg' is a compound-assignment operator like +=.
>>> The analogy with be16 and friends would be to use
>>>      neg64 r0
>>>      neg32 r0
>>>   whereas the analogy with everything else would be
>>>      r0 = -r0
>>>      r0 = (u32) -r0
>>>   as Alexei says.
>>> I'm happy to go with Alexei's version if it doesn't cause problems for llvm.
>>
>> I got some time to do some prototyping in llvm and it looks like that
>> I am able to
>> resolve the issue and we are able to use more C-like syntax. That is:
>> for bswap:
>>       r1 = (be16) (u16) r1
>>       or
>>       r1 = (be16) r1
>>       or
>>       r1 = be16 r1
>> for neg:
>>       r0 = -r0
>>       (for 32bit support, llvm may output "w0 = -w0" in the future. But
>> since it is not
>>        enabled yet, you can continue to output "r0 = (u32) -r0".)
>>
>> Not sure which syntax is best for bswap. The "r1 = (be16) (u16) r1" is most
>> explicit in its intention.
>>
>> Attaching my llvm patch as well and cc'ing Jiong and Jakub so they can see my
>> implementation and the relative discussion here. (In this patch, I did
>> not implement
>> bswap for little endian yet.) Maybe they can provide additional comments.
>
> This is awesome. In such case I'd like to swing back to the C syntax for bpf_end :)
> Any of these
>    r1 = (be16) (u16) r1
>    or
>    r1 = (be16) r1
>    or
>    r1 = be16 r1
> are better than just
>    be16 r1
> I like 1st the most, since it's explicit in terms of what happens with upper bits,
> but 2nd is also ok. 3rd is not quite C-like.

But above cast to be16 also doesn't seem quite C-like in terms
of what we're actually doing... 3rd option would be my personal
preference even if it doesn't look C-like, but otoh we also have
'call' etc which is neither.

^ permalink raw reply

* Re: [patch net] net: dsa: mv88e6xxx: Allow dsa and cpu ports in multiple vlans
From: Vivien Didelot @ 2017-09-25 21:34 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: netdev, Andrew Lunn
In-Reply-To: <1506375140-2853-1-git-send-email-andrew@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

> Ports with the same VLAN must all be in the same bridge. However the
> CPU and DSA ports need to be in multiple VLANs spread over multiple
> bridges. So exclude them when performing this test.
>
> Fixes: b2f81d304cee ("net: dsa: add CPU and DSA ports as VLAN members")
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

^ permalink raw reply

* Re: [PATCH net-next v2] bpf: Optimize lpm trie delete
From: David Miller @ 2017-09-25 21:38 UTC (permalink / raw)
  To: kraigatgoog; +Cc: daniel, ast, daniel, netdev
In-Reply-To: <20170921224329.101928-1-kraigatgoog@gmail.com>

From: Craig Gallek <kraigatgoog@gmail.com>
Date: Thu, 21 Sep 2017 18:43:29 -0400

> From: Craig Gallek <kraig@google.com>
> 
> Before the delete operator was added, this datastructure maintained
> an invariant that intermediate nodes were only present when necessary
> to build the tree.  This patch updates the delete operation to reinstate
> that invariant by removing unnecessary intermediate nodes after a node is
> removed and thus keeping the tree structure at a minimal size.
> 
> Suggested-by: Daniel Mack <daniel@zonque.org>
> Signed-off-by: Craig Gallek <kraig@google.com>

Applied, thank you.

^ permalink raw reply

* [patch net] net: dsa: mv88e6xxx: Allow dsa and cpu ports in multiple vlans
From: Andrew Lunn @ 2017-09-25 21:32 UTC (permalink / raw)
  To: David Miller; +Cc: Vivien Didelot, netdev, Andrew Lunn

Ports with the same VLAN must all be in the same bridge. However the
CPU and DSA ports need to be in multiple VLANs spread over multiple
bridges. So exclude them when performing this test.

Fixes: b2f81d304cee ("net: dsa: add CPU and DSA ports as VLAN members")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index c6678aa9b4ef..674dab71d71c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1100,6 +1100,10 @@ static int mv88e6xxx_port_check_hw_vlan(struct dsa_switch *ds, int port,
 	};
 	int i, err;
 
+	/* DSA and CPU ports have to be members of multiple vlans */
+	if (dsa_is_dsa_port(ds, port) || dsa_is_cpu_port(ds, port))
+		return 0;
+
 	if (!vid_begin)
 		return -EOPNOTSUPP;
 
-- 
2.14.1

^ permalink raw reply related

* WARNING: kernel stack frame pointer at ffff880156a5fea0 in bash:2103 has bad value 00007ffec7d87e50
From: Richard Weinberger @ 2017-09-25 21:23 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linux-kernel

Hi!

While playing with bcc's opensnoop tool on Linux 4.14-rc2 I managed to trigger 
this splat:

[  297.629773] WARNING: kernel stack frame pointer at ffff880156a5fea0 in 
bash:2103 has bad value 00007ffec7d87e50
[  297.629777] unwind stack type:0 next_sp:          (null) mask:0x6 
graph_idx:0
[  297.629783] ffff88015b207ae0: ffff88015b207b68 (0xffff88015b207b68)
[  297.629790] ffff88015b207ae8: ffffffffb163c00e (__save_stack_trace+0x6e/
0xd0)
[  297.629792] ffff88015b207af0: 0000000000000000 ...
[  297.629795] ffff88015b207af8: ffff880156a58000 (0xffff880156a58000)
[  297.629799] ffff88015b207b00: ffff880156a60000 (0xffff880156a60000)
[  297.629800] ffff88015b207b08: 0000000000000000 ...
[  297.629803] ffff88015b207b10: 0000000000000006 (0x6)
[  297.629806] ffff88015b207b18: ffff880151b02700 (0xffff880151b02700)
[  297.629809] ffff88015b207b20: 0000010100000000 (0x10100000000)
[  297.629812] ffff88015b207b28: ffff880156a5fea0 (0xffff880156a5fea0)
[  297.629815] ffff88015b207b30: ffff88015b207ae0 (0xffff88015b207ae0)
[  297.629818] ffff88015b207b38: ffffffffc0050282 (0xffffffffc0050282)
[  297.629819] ffff88015b207b40: 0000000000000000 ...
[  297.629822] ffff88015b207b48: 0000000001000000 (0x1000000)
[  297.629825] ffff88015b207b50: ffff880157b98280 (0xffff880157b98280)
[  297.629828] ffff88015b207b58: ffff880157b98380 (0xffff880157b98380)
[  297.629831] ffff88015b207b60: ffff88015ad2b500 (0xffff88015ad2b500)
[  297.629834] ffff88015b207b68: ffff88015b207b78 (0xffff88015b207b78)
[  297.629838] ffff88015b207b70: ffffffffb163c086 (save_stack_trace+0x16/0x20)
[  297.629841] ffff88015b207b78: ffff88015b207da8 (0xffff88015b207da8)
[  297.629847] ffff88015b207b80: ffffffffb18a8ed6 (save_stack+0x46/0xd0)
[  297.629850] ffff88015b207b88: 000000400000000c (0x400000000c)
[  297.629852] ffff88015b207b90: ffff88015b207ba0 (0xffff88015b207ba0)
[  297.629855] ffff88015b207b98: ffff880100000000 (0xffff880100000000)
[  297.629859] ffff88015b207ba0: ffffffffb163c086 (save_stack_trace+0x16/0x20)
[  297.629864] ffff88015b207ba8: ffffffffb18a8ed6 (save_stack+0x46/0xd0)
[  297.629868] ffff88015b207bb0: ffffffffb18a9752 (kasan_slab_free+0x72/0xc0)
[  297.629873] ffff88015b207bb8: ffffffffb18a5e90 (kmem_cache_free+0x70/0x190)
[  297.629879] ffff88015b207bc0: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.629886] ffff88015b207bc8: ffffffffb172580c (rcu_process_callbacks
+0x2dc/0xcd0)
[  297.629892] ffff88015b207bd0: ffffffffb2646cbc (__do_softirq+0x12c/0x343)
[  297.629897] ffff88015b207bd8: ffffffffb1692304 (irq_exit+0xe4/0xf0)
[  297.629902] ffff88015b207be0: ffffffffb2646446 (smp_apic_timer_interrupt
+0x86/0x1a0)
[  297.629907] ffff88015b207be8: ffffffffb26452f3 (apic_timer_interrupt
+0x93/0xa0)
[  297.629913] ffff88015b207bf0: ffffffffb1667417 (optimized_callback
+0x67/0x100)
[  297.629916] ffff88015b207bf8: ffffffffc0050282 (0xffffffffc0050282)
[  297.629918] ffff88015b207c00: 0000000000000000 ...
[  297.629921] ffff88015b207c08: ffff88015a77e24c (0xffff88015a77e24c)
[  297.629924] ffff88015b207c10: ffff88015b207c38 (0xffff88015b207c38)
[  297.629927] ffff88015b207c18: ffff88015b207c38 (0xffff88015b207c38)
[  297.629929] ffff88015b207c20: 0000000000000086 (0x86)
[  297.629932] ffff88015b207c28: ffff88015a77db00 (0xffff88015a77db00)
[  297.629935] ffff88015b207c30: 1ffff1002b640f91 (0x1ffff1002b640f91)
[  297.629938] ffff88015b207c38: ffff88015b207d10 (0xffff88015b207d10)
[  297.629945] ffff88015b207c40: ffffffffb16c9f60 (try_to_wake_up+0xb0/0x710)
[  297.629947] ffff88015b207c48: 0000000000000000 ...
[  297.629952] ffff88015b207c50: ffffffffb2dfd3c0 (machine_ops+0x40/0x40)
[  297.629954] ffff88015b207c58: ffff88015a77df94 (0xffff88015a77df94)
[  297.629957] ffff88015b207c60: 0000000000023540 (0x23540)
[  297.629960] ffff88015b207c68: ffff88015b215c38 (0xffff88015b215c38)
[  297.629963] ffff88015b207c70: ffff88015b200000 (0xffff88015b200000)
[  297.629965] ffff88015b207c78: 0000000000000086 (0x86)
[  297.629968] ffff88015b207c80: 0000000100000000 (0x100000000)
[  297.629971] ffff88015b207c88: 0000000041b58ab3 (0x41b58ab3)
[  297.629975] ffff88015b207c90: ffffffffb2d919f2 (.LC2+0x6e0e/0x83b5)
[  297.629981] ffff88015b207c98: ffffffffb16c9eb0 (migrate_swap_stop
+0x2e0/0x2e0)
[  297.629986] ffff88015b207ca0: ffffffffb16d0f73 (account_entity_dequeue
+0x73/0x110)
[  297.629989] ffff88015b207ca8: 0000000000100000 (0x100000)
[  297.629992] ffff88015b207cb0: ffff88015b2235a0 (0xffff88015b2235a0)
[  297.629994] ffff88015b207cb8: ffff88015061e280 (0xffff88015061e280)
[  297.629997] ffff88015b207cc0: ffff88015b207ce8 (0xffff88015b207ce8)
[  297.630003] ffff88015b207cc8: ffffffffb16c87ed (sched_avg_update+0x2d/0x90)
[  297.630005] ffff88015b207cd0: 0000000000000005 (0x5)
[  297.630008] ffff88015b207cd8: ffff88015b223570 (0xffff88015b223570)
[  297.630010] ffff88015b207ce0: 00000000000000dd (0xdd)
[  297.630013] ffff88015b207ce8: ffff88015a017ea0 (0xffff88015a017ea0)
[  297.630021] ffff88015b207cf0: ffffffffb30b7128 (rcu_sched_state
+0x928/0xaa0)
[  297.630024] ffff88015b207cf8: ffff880151b02700 (0xffff880151b02700)
[  297.630026] ffff88015b207d00: 0000000000000001 (0x1)
[  297.630031] ffff88015b207d08: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630034] ffff88015b207d10: ffff88015b207d20 (0xffff88015b207d20)
[  297.630040] ffff88015b207d18: ffffffffb16ca5d0 (wake_up_process+0x10/0x20)
[  297.630043] ffff88015b207d20: ffff88015b207d48 (0xffff88015b207d48)
[  297.630045] ffff88015b207d28: ffff88015b207d48 (0xffff88015b207d48)
[  297.630048] ffff88015b207d30: 0000000000000202 (0x202)
[  297.630053] ffff88015b207d38: ffffffffb30b7120 (rcu_sched_state
+0x920/0xaa0)
[  297.630056] ffff88015b207d40: 0000000000000202 (0x202)
[  297.630059] ffff88015b207d48: ffff88015b207d68 (0xffff88015b207d68)
[  297.630063] ffff88015b207d50: ffffffffb16ee225 (swake_up+0x25/0x30)
[  297.630069] ffff88015b207d58: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630072] ffff88015b207d60: ffff88015a77db00 (0xffff88015a77db00)
[  297.630074] ffff88015b207d68: ffff88015b207d90 (0xffff88015b207d90)
[  297.630079] ffff88015b207d70: ffffffffb1720016 (rcu_gp_kthread_wake
+0x56/0x60)
[  297.630082] ffff88015b207d78: 0000000000000002 (0x2)
[  297.630087] ffff88015b207d80: ffffffffb30b7138 (rcu_sched_state
+0x938/0xaa0)
[  297.630092] ffff88015b207d88: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630095] ffff88015b207d90: ffff88015b207e18 (0xffff88015b207e18)
[  297.630099] ffff88015b207d98: ffffffffb1720521 (rcu_report_qs_rnp
+0x2f1/0x310)
[  297.630102] ffff88015b207da0: ffff88015ad2b500 (0xffff88015ad2b500)
[  297.630105] ffff88015b207da8: ffff88015b207dd0 (0xffff88015b207dd0)
[  297.630110] ffff88015b207db0: ffffffffb18a9752 (kasan_slab_free+0x72/0xc0)
[  297.630113] ffff88015b207db8: ffff880157b98280 (0xffff880157b98280)
[  297.630116] ffff88015b207dc0: ffffea00055ee600 (0xffffea00055ee600)
[  297.630121] ffff88015b207dc8: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.630124] ffff88015b207dd0: ffff88015b207e00 (0xffff88015b207e00)
[  297.630128] ffff88015b207dd8: ffffffffb18a5e90 (kmem_cache_free+0x70/0x190)
[  297.630131] ffff88015b207de0: ffff880157b98280 (0xffff880157b98280)
[  297.630135] ffff88015b207de8: ffffffffb18b7e60 (get_max_files+0x10/0x10)
[  297.630141] ffff88015b207df0: ffffffffb30b72a0 (rcu_sched_state
+0xaa0/0xaa0)
[  297.630143] ffff88015b207df8: 000000000000000f (0xf)
[  297.630146] ffff88015b207e00: ffff88015b207e18 (0xffff88015b207e18)
[  297.630150] ffff88015b207e08: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.630153] ffff88015b207e10: ffff880157b98280 (0xffff880157b98280)
[  297.630156] ffff88015b207e18: ffff88015b207f30 (0xffff88015b207f30)
[  297.630161] ffff88015b207e20: ffffffffb172580c (rcu_process_callbacks
+0x2dc/0xcd0)
[  297.630164] ffff88015b207e28: ffff88015b21b000 (0xffff88015b21b000)
[  297.630167] ffff88015b207e30: ffff88015b21b070 (0xffff88015b21b070)
[  297.630170] ffff88015b207e38: 1ffff1002b640fd5 (0x1ffff1002b640fd5)
[  297.630173] ffff88015b207e40: ffff880151b02700 (0xffff880151b02700)
[  297.630176] ffff88015b207e48: ffff88015b224200 (0xffff88015b224200)
[  297.630178] ffff88015b207e50: ffff88015b224280 (0xffff88015b224280)
[  297.630181] ffff88015b207e58: ffff88015b2242b0 (0xffff88015b2242b0)
[  297.630184] ffff88015b207e60: ffff88015b207f08 (0xffff88015b207f08)
[  297.630187] ffff88015b207e68: ffff880151b0274c (0xffff880151b0274c)
[  297.630190] ffff88015b207e70: ffff880151b02700 (0xffff880151b02700)
[  297.630195] ffff88015b207e78: ffffffffb30b7258 (rcu_sched_state
+0xa58/0xaa0)
[  297.630198] ffff88015b207e80: ffff880157b98288 (0xffff880157b98288)
[  297.630203] ffff88015b207e88: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630206] ffff88015b207e90: ffff88015b224238 (0xffff88015b224238)
[  297.630209] ffff88015b207e98: ffff88015b207ec8 (0xffff88015b207ec8)
[  297.630211] ffff88015b207ea0: 000000000000000a (0xa)
[  297.630214] ffff88015b207ea8: 0000000041b58ab3 (0x41b58ab3)
[  297.630218] ffff88015b207eb0: ffffffffb2d944f5 (.LC0+0x155c/0xa3a6)
[  297.630223] ffff88015b207eb8: ffffffffb1725530 (note_gp_changes+0xe0/0xe0)
[  297.630226] ffff88015b207ec0: ffff88015b215740 (0xffff88015b215740)
[  297.630229] ffff88015b207ec8: ffff880157b983c0 (0xffff880157b983c0)
[  297.630231] ffff88015b207ed0: ffff88014ac19eb0 (0xffff88014ac19eb0)
[  297.630234] ffff88015b207ed8: ffffffffffffffff (0xffffffffffffffff)
[  297.630236] ffff88015b207ee0: 0000000000000000 ...
[  297.630239] ffff88015b207ee8: 0000004552dda1c0 (0x4552dda1c0)
[  297.630240] ffff88015b207ef0: 0000000000000000 ...
[  297.630243] ffff88015b207ef8: ffff88015b207f20 (0xffff88015b207f20)
[  297.630249] ffff88015b207f00: ffffffffb174a0a8 (tick_program_event
+0x48/0x80)
[  297.630252] ffff88015b207f08: 0000000000000009 (0x9)
[  297.630259] ffff88015b207f10: ffffffffb3009148 (softirq_vec+0x48/0x80)
[  297.630261] ffff88015b207f18: 0000000000000009 (0x9)
[  297.630263] ffff88015b207f20: 0000000000000008 (0x8)
[  297.630265] ffff88015b207f28: 0000000000000009 (0x9)
[  297.630268] ffff88015b207f30: ffff88015b207fa8 (0xffff88015b207fa8)
[  297.630273] ffff88015b207f38: ffffffffb2646cbc (__do_softirq+0x12c/0x343)
[  297.630276] ffff88015b207f40: 0000000a00404100 (0xa00404100)
[  297.630279] ffff88015b207f48: ffff880151b02700 (0xffff880151b02700)
[  297.630282] ffff88015b207f50: 00000000fffff730 (0xfffff730)
[  297.630284] ffff88015b207f58: 0000000000000009 (0x9)
[  297.630286] ffff88015b207f60: 0000000000000040 (0x40)
[  297.630289] ffff88015b207f68: 000001005b21c294 (0x1005b21c294)
[  297.630294] ffff88015b207f70: ffffffffb3009110 (softirq_vec+0x10/0x80)
[  297.630297] ffff88015b207f78: 0000008000000008 (0x8000000008)
[  297.630300] ffff88015b207f80: ffff88015a77ce00 (0xffff88015a77ce00)
[  297.630303] ffff88015b207f88: ffff88015b215840 (0xffff88015b215840)
[  297.630304] ffff88015b207f90: 0000000000000000 ...
[  297.630307] ffff88015b207f98: ffff880156a5feb0 (0xffff880156a5feb0)
[  297.630311] ffff88015b207fa0: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630314] ffff88015b207fa8: ffff88015b207fc0 (0xffff88015b207fc0)
[  297.630318] ffff88015b207fb0: ffffffffb1692304 (irq_exit+0xe4/0xf0)
[  297.630321] ffff88015b207fb8: ffff88015b215740 (0xffff88015b215740)
[  297.630324] ffff88015b207fc0: ffff88015b207fe8 (0xffff88015b207fe8)
[  297.630329] ffff88015b207fc8: ffffffffb2646446 (smp_apic_timer_interrupt
+0x86/0x1a0)
[  297.630332] ffff88015b207fd0: ffff88015104d500 (0xffff88015104d500)
[  297.630335] ffff88015b207fd8: ffff88015b215840 (0xffff88015b215840)
[  297.630338] ffff88015b207fe0: 0000000000000246 (0x246)
[  297.630341] ffff88015b207fe8: ffff880156a5fdc9 (0xffff880156a5fdc9)
[  297.630345] ffff88015b207ff0: ffffffffb26452f3 (apic_timer_interrupt
+0x93/0xa0)
[  297.630348] ffff88015b207ff8: ffff880156a5fdc8 (0xffff880156a5fdc8)
[  297.630352] ffff880156a5fdc8: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630355] ffff880156a5fdd0: ffff880156a5feb0 (0xffff880156a5feb0)
[  297.630357] ffff880156a5fdd8: 0000000000000246 (0x246)
[  297.630360] ffff880156a5fde0: ffff88015b215840 (0xffff88015b215840)
[  297.630363] ffff880156a5fde8: ffff880156a5fea0 (0xffff880156a5fea0)
[  297.630366] ffff880156a5fdf0: ffff88015104d500 (0xffff88015104d500)
[  297.630369] ffff880156a5fdf8: fffff52000140c08 (0xfffff52000140c08)
[  297.630372] ffff880156a5fe00: ffffc90000a0603f (0xffffc90000a0603f)
[  297.630375] ffff880156a5fe08: fffff52000140c07 (0xfffff52000140c07)
[  297.630378] ffff880156a5fe10: fffff52000140c08 (0xfffff52000140c08)
[  297.630379] ffff880156a5fe18: 0000000000000000 ...
[  297.630385] ffff880156a5fe20: ffffffffb178d9eb (opt_pre_handler+0x6b/0x80)
[  297.630388] ffff880156a5fe28: dffffc0000000000 (0xdffffc0000000000)
[  297.630391] ffff880156a5fe30: dffffc0000000000 (0xdffffc0000000000)
[  297.630393] ffff880156a5fe38: 0000000000000246 (0x246)
[  297.630396] ffff880156a5fe40: ffffffffffffff10 (0xffffffffffffff10)
[  297.630401] ffff880156a5fe48: ffffffffb1667417 (optimized_callback
+0x67/0x100)
[  297.630404] ffff880156a5fe50: 0000000000000010 (0x10)
[  297.630406] ffff880156a5fe58: 0000000000000246 (0x246)
[  297.630409] ffff880156a5fe60: ffff880156a5fe78 (0xffff880156a5fe78)
[  297.630412] ffff880156a5fe68: 0000000000000018 (0x18)
[  297.630414] ffff880156a5fe70: 0000000000000246 (0x246)
[  297.630417] ffff880156a5fe78: 00000000026aed08 (0x26aed08)
[  297.630419] ffff880156a5fe80: 0000000000000005 (0x5)
[  297.630421] ffff880156a5fe88: 0000000000000003 (0x3)
[  297.630423] ffff880156a5fe90: 0000000000000000 ...
[  297.630425] ffff880156a5fe98: 00000000025d1568 (0x25d1568)
[  297.630428] ffff880156a5fea0: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630431] ffff880156a5fea8: ffffffffc0050282 (0xffffffffc0050282)
[  297.630433] ffff880156a5feb0: 00000000025d1568 (0x25d1568)
[  297.630435] ffff880156a5feb8: 0000000000000000 ...
[  297.630437] ffff880156a5fec0: 0000000000000003 (0x3)
[  297.630440] ffff880156a5fec8: 0000000000000005 (0x5)
[  297.630442] ffff880156a5fed0: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630445] ffff880156a5fed8: 00000000026aed08 (0x26aed08)
[  297.630448] ffff880156a5fee0: ffff880151b02700 (0xffff880151b02700)
[  297.630450] ffff880156a5fee8: 0000000002675e00 (0x2675e00)
[  297.630453] ffff880156a5fef0: 0000000000000001 (0x1)
[  297.630455] ffff880156a5fef8: 0000000000000002 (0x2)
[  297.630457] ffff880156a5ff00: 0000000000000002 (0x2)
[  297.630460] ffff880156a5ff08: 0000000002675e00 (0x2675e00)
[  297.630462] ffff880156a5ff10: 0000000000000180 (0x180)
[  297.630464] ffff880156a5ff18: 0000000000000000 ...
[  297.630466] ffff880156a5ff20: 000000000272a008 (0x272a008)
[  297.630469] ffff880156a5ff28: ffffffffffffffff (0xffffffffffffffff)
[  297.630473] ffff880156a5ff30: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630475] ffff880156a5ff38: 0000000000000010 (0x10)
[  297.630478] ffff880156a5ff40: 0000000000000293 (0x293)
[  297.630481] ffff880156a5ff48: ffff880156a5ff50 (0xffff880156a5ff50)
[  297.630485] ffff880156a5ff50: ffffffffb1665770 (copy_oldmem_page+0x90/0x90)
[  297.630488] ffff880156a5ff58: 00000000025d1b28 (0x25d1b28)
[  297.630489] ffff880156a5ff60: 0000000000000000 ...
[  297.630492] ffff880156a5ff68: 0000000000000003 (0x3)
[  297.630494] ffff880156a5ff70: 0000000000000005 (0x5)
[  297.630497] ffff880156a5ff78: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630499] ffff880156a5ff80: 00000000026aed08 (0x26aed08)
[  297.630502] ffff880156a5ff88: 0000000000000246 (0x246)
[  297.630504] ffff880156a5ff90: 0000000002675e00 (0x2675e00)
[  297.630506] ffff880156a5ff98: 0000000000000001 (0x1)
[  297.630509] ffff880156a5ffa0: 0000000000000002 (0x2)
[  297.630511] ffff880156a5ffa8: ffffffffffffffda (0xffffffffffffffda)
[  297.630514] ffff880156a5ffb0: 00007f3d3f7be4e0 (0x7f3d3f7be4e0)
[  297.630517] ffff880156a5ffb8: 0000000000000180 (0x180)
[  297.630518] ffff880156a5ffc0: 0000000000000000 ...
[  297.630521] ffff880156a5ffc8: 000000000272a008 (0x272a008)
[  297.630523] ffff880156a5ffd0: 0000000000000002 (0x2)
[  297.630526] ffff880156a5ffd8: 00007f3d3f7be4e0 (0x7f3d3f7be4e0)
[  297.630528] ffff880156a5ffe0: 0000000000000033 (0x33)
[  297.630530] ffff880156a5ffe8: 0000000000000246 (0x246)
[  297.630533] ffff880156a5fff0: 00007ffec7d87db8 (0x7ffec7d87db8)
[  297.630535] ffff880156a5fff8: 000000000000002b (0x2b)

opensnoop(pythong) itself blocks too:

root@test:~# cat /proc/2075/stack
[<ffffffffb79a0a07>] ring_buffer_wait+0x167/0x2e0
[<ffffffffb79a34e7>] wait_on_pipe+0x77/0x80
[<ffffffffb79aa7a1>] tracing_wait_pipe.isra.69+0x51/0xf0
[<ffffffffb79abdf9>] tracing_read_pipe+0x1c9/0x500
[<ffffffffb7ab5e62>] __vfs_read+0xd2/0x370
[<ffffffffb7ab61b7>] vfs_read+0xb7/0x1a0
[<ffffffffb7ab6bd0>] SyS_read+0xa0/0x120
[<ffffffffb8843c37>] entry_SYSCALL_64_fastpath+0x1a/0xa5
[<ffffffffffffffff>] 0xffffffffffffffff

Thanks,
//richard

-- 
sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria
ATU66964118 - FN 374287y

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Daniel Borkmann @ 2017-09-25 21:16 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925171351.4956-1-xiyou.wangcong@gmail.com>

On 09/25/2017 07:13 PM, Cong Wang wrote:
> Instead of calling cls_bpf_get() in a loop to find
> a unused handle, just switch to idr API to allocate
> new handles.
>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Chris Mi <chrism@mellanox.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
[...]
> @@ -476,21 +462,30 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   		}
>   	}
>
> -	if (handle == 0)
> -		prog->handle = cls_bpf_grab_new_handle(tp, head);
> -	else
> +	if (handle == 0) {
> +		ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
> +				    1, 0x7FFFFFFF, GFP_KERNEL);
> +		if (ret)
> +			goto errout;
> +		prog->handle = idr_index;
> +	} else {
> +		if (!oldprog) {
> +			ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
> +					    handle, handle + 1, GFP_KERNEL);
> +			if (ret)
> +				goto errout;
> +		}
>   		prog->handle = handle;
> -	if (prog->handle == 0) {
> -		ret = -EINVAL;
> -		goto errout;
>   	}
>
>   	ret = cls_bpf_set_parms(net, tp, prog, base, tb, tca[TCA_RATE], ovr);
>   	if (ret < 0)
> -		goto errout;
> +		goto errout_idr;
>
>   	ret = cls_bpf_offload(tp, prog, oldprog);
>   	if (ret) {
> +		if (!oldprog)
> +			idr_remove_ext(&head->handle_idr, prog->handle);

Shouldn't we also call idr_remove_ext() when there was an
oldprog, but we didn't care about reusing the same handle,
so it was handle == 0 initially?

There's this condition in the code before above idr allocations,
I think also in other classifiers:

         if (oldprog) {
                 if (handle && oldprog->handle != handle) {
                         ret = -EINVAL;
                         goto errout;
                 }
         }

>   		__cls_bpf_delete_prog(prog);
>   		return ret;
>   	}
> @@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>
>   	if (oldprog) {
> +		idr_replace_ext(&head->handle_idr, prog, handle);

And here, we should probably use prog->handle for the above
mentioned case as well, no?

Would be great if all this (and e.g. the fact that we use idr itself)
could optionally be hidden behind some handle generator api given
we could reuse that api also for cls_basic and cls_u32. Could also
be followed-up perhaps.

>   		list_replace_rcu(&oldprog->link, &prog->link);
>   		tcf_unbind_filter(tp, &oldprog->res);
>   		call_rcu(&oldprog->rcu, cls_bpf_delete_prog_rcu);
> @@ -509,6 +505,9 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   	*arg = prog;
>   	return 0;
>
> +errout_idr:
> +	if (!oldprog)
> +		idr_remove_ext(&head->handle_idr, prog->handle);

(Likewise as the failing cls_bpf_offload().)

>   errout:
>   	tcf_exts_destroy(&prog->exts);
>   	kfree(prog);
>

^ permalink raw reply

* [PATCH net-next] net: ipv6: send NS for DAD when link operationally up
From: Mike Manning @ 2017-09-25 21:01 UTC (permalink / raw)
  To: netdev

The NS for DAD are sent on admin up as long as a valid qdisc is found.
A race condition exists by which these packets will not egress the
interface if the operational state of the lower device is not yet up.
The solution is to delay DAD until the link is operationally up
according to RFC2863. Rather than only doing this, follow the existing
code checks by deferring IPv6 device initialization altogether. The fix
allows DAD on devices like tunnels that are controlled by userspace
control plane. The fix has no impact on regular deployments, but means
that there is no IPv6 connectivity until the port has been opened in
the case of port-based network access control, which should be
desirable.

Signed-off-by: Mike Manning <mmanning@brocade.com>
---
 net/ipv6/addrconf.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c2e2a78..dffbf3b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -303,10 +303,10 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
 	.disable_policy		= 0,
 };
 
-/* Check if a valid qdisc is available */
-static inline bool addrconf_qdisc_ok(const struct net_device *dev)
+/* Check if link is ready: is it up and is a valid qdisc available */
+static inline bool addrconf_link_ready(const struct net_device *dev)
 {
-	return !qdisc_tx_is_noop(dev);
+	return netif_oper_up(dev) && !qdisc_tx_is_noop(dev);
 }
 
 static void addrconf_del_rs_timer(struct inet6_dev *idev)
@@ -451,7 +451,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 
 	ndev->token = in6addr_any;
 
-	if (netif_running(dev) && addrconf_qdisc_ok(dev))
+	if (netif_running(dev) && addrconf_link_ready(dev))
 		ndev->if_flags |= IF_READY;
 
 	ipv6_mc_init_dev(ndev);
@@ -3393,7 +3393,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 			/* restore routes for permanent addresses */
 			addrconf_permanent_addr(dev);
 
-			if (!addrconf_qdisc_ok(dev)) {
+			if (!addrconf_link_ready(dev)) {
 				/* device is not ready yet. */
 				pr_info("ADDRCONF(NETDEV_UP): %s: link is not ready\n",
 					dev->name);
@@ -3408,7 +3408,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 				run_pending = 1;
 			}
 		} else if (event == NETDEV_CHANGE) {
-			if (!addrconf_qdisc_ok(dev)) {
+			if (!addrconf_link_ready(dev)) {
 				/* device is still not ready. */
 				break;
 			}
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v2 0/2] enable hires timer to timeout datagram socket
From: Thomas Gleixner @ 2017-09-25 20:42 UTC (permalink / raw)
  To: Vallish Vaidyeshwara
  Cc: Eric Dumazet, Eduardo Valentin, David Miller, dwmw2, shuah,
	richardcochran, xiyou.wangcong, netdev, linux-kernel, anchalag,
	dwmw
In-Reply-To: <20170920224816.GA73561@amazon.com>

On Wed, 20 Sep 2017, Vallish Vaidyeshwara wrote:
> On Sat, Sep 16, 2017 at 11:47:56AM +0200, Thomas Gleixner wrote:
> > > So if we need to replace all 'legacy' timers to high resolution timer,
> > > because some application was _relying_ on jiffies being kind of precise,
> > > maybe it is better to revert the change done on legacy timers.
> > 
> > Which would be a major step back in terms of timer performance and system
> > disturbance caused by massive recascading operations.
> > 
> > > Or continue the migration and make them use high res internally.
> > > 
> > > select() and poll() are the standard way to have precise timeouts,
> > > it is silly we have to maintain a timeout handling in the datagram fast
> > > path.
> > 
> > A few years ago we switched select/poll over to use hrtimers because the
> > wheel timers were too inaccurate for some operations, so it feels
> > consequent to switch the timeout in the datagram rcv path over as well. I
> > agree that the whole timeout magic there feels silly, but unfortunately
> > it's a documented property of sockets.
> > 
> 
> Thanks for your comments. This patch has been NACK'ed by David Miller. Is
> there any other approach to solve this problem with out application code
> being recompiled?

We have only three options here:

   1) Do a massive revert of the timer wheel changes and lose all the
      benefits of that rework.

   2) Make that timer list -> hrtimer change in the datagram code

   3) Ignore it

#1 Would be pretty ironic as networking would take the biggest penalty of
   the revert.

#2 Is IMO the proper solution as it cures a user space visible regression,
   though the patch itself could be made way simpler

#3 Shrug

Dave, Eric?

Thanks,

	tglx

^ permalink raw reply

* Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
From: Paolo Abeni @ 2017-09-25 20:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> > 
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> > 
> > To perform this tasks a couple of facilities are added:
> > 
> > * noref socket references, scoped inside the current RCU section, to be
> >   explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> >   related local dst entry
> > 
> > The measured performance gain under small packet UDP flood is as follow:
> > 
> > ingress NIC	vanilla		patched		delta
> > rx queues	(kpps)		(kpps)		(%)
> > [ipv4]
> > 1		2177		2414		10
> > 2		2527		2892		14
> > 3		3050		3733		22
> 
> 
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
> 
> return socket[RX_QUEUE# | or CPU#];
> 
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
> 
> return socket[hash(skb)];
> 
> Multiple cpus can then :
>  - compete on grabbing same socket refcount
>  - compete on grabbing the receive queue lock
>  - compete for releasing lock and socket refcount
>  - skb freeing done on different cpus than where allocated.
> 
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
> 
> First solve the false sharing issue.
> 
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
> 
> Then we can see if the gains you claim are still applicable.

Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:

ingress NIC     vanilla         patched         delta
rx queues       (kpps)          (kpps)          (%)
[ipv4]
2               3020                3663                21
3               4352                5179                19
4               5318                6194                16
5               6258                7583                21
6               7376                8558                16

[ipv6]
2               2446                3949                61
3               3099                5092                64
4               3698                6611                78
5               4382                7852                79
6               5116                8851                73

Sone notes:

- figures obtained with: 

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        [ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
        udp_sink  --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &
        taskset -p $((MASK << ($I + $n) )) $!
done

- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.

- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.

> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.

Interesting, looking forward to that!

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH v2 4/4] net: nfc: llcp_core: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-4-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:05 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 3/4] net: nfc: hci: llc_shdlc: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-3-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:04 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 2/4] net: nfc: hci: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-2-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:03 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 1/4] net: af_packet: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-1-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:02 +0530

> Use setup_timer function instead of initializing timer with the
> function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: Regression in throughput between kvm guests over virtual bridge
From: Matthew Rosato @ 2017-09-25 20:18 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst
In-Reply-To: <3f824b0e-65f9-c69c-5421-2c5f6b349b09@redhat.com>

On 09/22/2017 12:03 AM, Jason Wang wrote:
> 
> 
> On 2017年09月21日 03:38, Matthew Rosato wrote:
>>> Seems to make some progress on wakeup mitigation. Previous patch tries
>>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>>> patch goes even further which disables rx polling during processing tx.
>>> Please try it to see if it has any difference.
>> Unfortunately, this patch doesn't seem to have made a difference.  I
>> tried runs with both this patch and the previous patch applied, as well
>> as only this patch applied for comparison (numbers from vhost thread of
>> sending VM):
>>
>> 4.12    4.13     patch1   patch2   patch1+2
>> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>>
>> In each case, the regression in throughput was still present.
> 
> This probably means some other cases of the wakeups were missed. Could
> you please record the callers of __wake_up_sync_key()?
> 

Hi Jason,

With your 2 previous patches applied, every call to __wake_up_sync_key
(for both sender and server vhost threads) shows the following stack trace:

     vhost-11478-11520 [002] ....   312.927229: __wake_up_sync_key
<-sock_def_readable
     vhost-11478-11520 [002] ....   312.927230: <stack trace>
 => dev_hard_start_xmit
 => sch_direct_xmit
 => __dev_queue_xmit
 => br_dev_queue_push_xmit
 => br_forward_finish
 => __br_forward
 => br_handle_frame_finish
 => br_handle_frame
 => __netif_receive_skb_core
 => netif_receive_skb_internal
 => tun_get_user
 => tun_sendmsg
 => handle_tx
 => vhost_worker
 => kthread
 => kernel_thread_starter
 => kernel_thread_starter

>>
>>> And two questions:
>>> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
>> Verified that the second set of guests are not actually required, I can
>> see the regression with only 2 VMs.
>>
>>> - Can enable batching in the tap of sending VM improve the performance
>>> (ethtool -C $tap rx-frames 64)
>> I tried this, but it did not help (actually seemed to make things a
>> little worse)
>>
> 
>  I still can't see a reason that can lead more wakeups, will take more
> time to look at this issue and keep you posted.
> 
> Thanks
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox