Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v9 0/6] net: vhost: improve performance when enable busyloop
From: David Miller @ 2018-09-09 15:01 UTC (permalink / raw)
  To: xiangxia.m.yue; +Cc: netdev, virtualization, mst
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: xiangxia.m.yue@gmail.com
Date: Sun,  9 Sep 2018 04:51:21 -0700

> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> 
> This patches improve the guest receive performance.
> On the handle_tx side, we poll the sock receive queue
> at the same time. handle_rx do that in the same way.
> 
> For more performance report, see patch 4, 5, 6

I only see patches 1-4.

^ permalink raw reply

* Re: [PATCH] r8169: inform about CLKRUN protocol issue when behind a CardBus bridge
From: David Miller @ 2018-09-09 15:09 UTC (permalink / raw)
  To: mail; +Cc: nic_swsd, netdev, linux-kernel
In-Reply-To: <fcf6a23a-4721-5912-281e-488eb6a736c6@maciej.szmigiero.name>

From: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
Date: Thu, 6 Sep 2018 18:10:53 +0200

> It turns out that at least some r8169 CardBus cards don't operate correctly
> when CLKRUN protocol is enabled - the symptoms are recurring timeouts
> during PHY reads / writes and a very high packet drop rate.
> This is true of at least RTL8169sc/8110sc (XID 18000000) chip in
> Sunrich C-160 CardBus NIC.
> 
> Such behavior was observed on two separate laptops, the first one has
> TI PCIxx12 CardBus bridge, while the second one has Ricoh RL5c476II.
> 
> Setting CLKRUN_En bit in CONFIG 3 register via an EEPROM write didn't
> improve things in either case (this is probably why it wasn't set by the
> card manufacturer).
> The only way to fix the issue was to disable the CLKRUN protocol either
> in the CardBus bridge (only possible in the TI one) or in the southbridge.
> 
> Since the problem takes some time to debug let's warn people that have
> the suspect configuration (Conventional PCI r8169 NIC behind a CardBus
> bridge) so they know what they can do if they encounter it.
> 
> Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>

I don't know about this.

Barking at the user in the kernel log about an obscure knob (which btw
doesn't exist for all cardbus bridges without other patches you are
posting elsewhere) is rarely effective.

We should just disable clkrun automatically we know it causes problems.

Sorry, I don't think this is that right approach and therefore I am not
applying this.

^ permalink raw reply

* Re: [PATCH net 02/13] net: sched: cls_u32: mark root hnode explicitly
From: Jamal Hadi Salim @ 2018-09-09 11:39 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-3-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 03/13] net: sched: cls_u32: disallow linking to root hnode
From: Jamal Hadi Salim @ 2018-09-09 11:39 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-4-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> Operation makes no sense.  Nothing will actually break if we do so
> (depth limit in u32_classify() will prevent infinite loops), but
> according to maintainers it's best prohibited outright.
> 
> NOTE: doing so guarantees that u32_destroy() will trigger the call
> of u32_destroy_hnode(); we might want to make that unconditional.
> 
> Test:
> tc qdisc add dev eth0 ingress
> tc filter add dev eth0 parent ffff: protocol ip prio 100 u32 \
> 	link 800: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff
> should fail with
> Error: cls_u32: Not linking to root node
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 04/13] net: sched: cls_u32: make sure that divisor is a power of 2
From: Jamal Hadi Salim @ 2018-09-09 11:41 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-5-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
>  > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Test is by hacking user space iproute2 to allow
sending a divisor > 255

Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 05/13] net: sched: cls_u32: get rid of unused argument of u32_destroy_key()
From: Jamal Hadi Salim @ 2018-09-09 11:41 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-6-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 06/13] net: sched: cls_u32: get rid of tc_u_knode ->tp
From: Jamal Hadi Salim @ 2018-09-09 11:43 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-7-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> not used anymore
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 07/13] net: sched: cls_u32: get rid of tc_u_common ->rcu
From: Jamal Hadi Salim @ 2018-09-09 11:45 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-8-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> unused
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 08/13] net: sched: cls_u32: clean tc_u_common hashtable
From: Jamal Hadi Salim @ 2018-09-09 11:47 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-9-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> * calculate key *once*, not for each hash chain element
> * let tc_u_hash() return the pointer to chain head rather than index -
> callers are cleaner that way.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* [PATCH net-next v9 0/6] net: vhost: improve performance when enable busyloop
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: virtualization, netdev, Tonghao Zhang

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patches improve the guest receive performance.
On the handle_tx side, we poll the sock receive queue
at the same time. handle_rx do that in the same way.

For more performance report, see patch 4, 5, 6

Tonghao Zhang (6):
  net: vhost: lock the vqs one by one
  net: vhost: replace magic number of lock annotation
  net: vhost: factor out busy polling logic to vhost_net_busy_poll()
  net: vhost: add rx busy polling in tx path
  net: vhost: disable rx wakeup during tx busypoll
  net: vhost: make busyloop_intr more accurate

 drivers/vhost/net.c   | 163 +++++++++++++++++++++++++++++++-------------------
 drivers/vhost/vhost.c |  24 +++-----
 2 files changed, 108 insertions(+), 79 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net-next v9 1/6] net: vhost: lock the vqs one by one
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: virtualization, netdev, Tonghao Zhang
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch changes the way that lock all vqs
at the same, to lock them one by one. It will
be used for next patch to avoid the deadlock.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a502f1a..a1c06e7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -294,8 +294,11 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 {
 	int i;
 
-	for (i = 0; i < d->nvqs; ++i)
+	for (i = 0; i < d->nvqs; ++i) {
+		mutex_lock(&d->vqs[i]->mutex);
 		__vhost_vq_meta_reset(d->vqs[i]);
+		mutex_unlock(&d->vqs[i]->mutex);
+	}
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -890,20 +893,6 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 #define vhost_get_used(vq, x, ptr) \
 	vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
 
-static void vhost_dev_lock_vqs(struct vhost_dev *d)
-{
-	int i = 0;
-	for (i = 0; i < d->nvqs; ++i)
-		mutex_lock_nested(&d->vqs[i]->mutex, i);
-}
-
-static void vhost_dev_unlock_vqs(struct vhost_dev *d)
-{
-	int i = 0;
-	for (i = 0; i < d->nvqs; ++i)
-		mutex_unlock(&d->vqs[i]->mutex);
-}
-
 static int vhost_new_umem_range(struct vhost_umem *umem,
 				u64 start, u64 size, u64 end,
 				u64 userspace_addr, int perm)
@@ -953,7 +942,10 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
 		if (msg->iova <= vq_msg->iova &&
 		    msg->iova + msg->size - 1 > vq_msg->iova &&
 		    vq_msg->type == VHOST_IOTLB_MISS) {
+			mutex_lock(&node->vq->mutex);
 			vhost_poll_queue(&node->vq->poll);
+			mutex_unlock(&node->vq->mutex);
+
 			list_del(&node->node);
 			kfree(node);
 		}
@@ -985,7 +977,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 	int ret = 0;
 
 	mutex_lock(&dev->mutex);
-	vhost_dev_lock_vqs(dev);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
 		if (!dev->iotlb) {
@@ -1019,7 +1010,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 		break;
 	}
 
-	vhost_dev_unlock_vqs(dev);
 	mutex_unlock(&dev->mutex);
 
 	return ret;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net 09/13] net: sched: cls_u32: pass tc_u_common to u32_set_parms() instead of tc_u_hnode
From: Jamal Hadi Salim @ 2018-09-09 11:52 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-10-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> the only thing we used ht for was ht->tp_c and callers can get that
> without going through ->tp_c at all; start with lifting that into
> the callers, next commits will massage those, eventually removing
> ->tp_c altogether.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* [PATCH net-next v9 2/6] net: vhost: replace magic number of lock annotation
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: virtualization, netdev, Tonghao Zhang
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

Use the VHOST_NET_VQ_XXX as a subclass for mutex_lock_nested.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 367d802..32c1b52 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -712,7 +712,7 @@ static void handle_tx(struct vhost_net *net)
 	struct vhost_virtqueue *vq = &nvq->vq;
 	struct socket *sock;
 
-	mutex_lock(&vq->mutex);
+	mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_TX);
 	sock = vq->private_data;
 	if (!sock)
 		goto out;
@@ -777,7 +777,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 		/* Flush batched heads first */
 		vhost_net_signal_used(rnvq);
 		/* Both tx vq and rx socket were polled here */
-		mutex_lock_nested(&tvq->mutex, 1);
+		mutex_lock_nested(&tvq->mutex, VHOST_NET_VQ_TX);
 		vhost_disable_notify(&net->dev, tvq);
 
 		preempt_disable();
@@ -919,7 +919,7 @@ static void handle_rx(struct vhost_net *net)
 	__virtio16 num_buffers;
 	int recv_pkts = 0;
 
-	mutex_lock_nested(&vq->mutex, 0);
+	mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
 	sock = vq->private_data;
 	if (!sock)
 		goto out;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v9 3/6] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: virtualization, netdev, Tonghao Zhang
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

Factor out generic busy polling logic and will be
used for in tx path in the next patch. And with the patch,
qemu can set differently the busyloop_timeout for rx queue.

To avoid duplicate codes, introduce the helper functions:
* sock_has_rx_data(changed from sk_has_rx_data)
* vhost_net_busy_poll_try_queue

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 drivers/vhost/net.c | 109 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 69 insertions(+), 40 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 32c1b52..3f68265 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -440,6 +440,73 @@ static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
 	nvq->done_idx = 0;
 }
 
+static int sock_has_rx_data(struct socket *sock)
+{
+	if (unlikely(!sock))
+		return 0;
+
+	if (sock->ops->peek_len)
+		return sock->ops->peek_len(sock);
+
+	return skb_queue_empty(&sock->sk->sk_receive_queue);
+}
+
+static void vhost_net_busy_poll_try_queue(struct vhost_net *net,
+					  struct vhost_virtqueue *vq)
+{
+	if (!vhost_vq_avail_empty(&net->dev, vq)) {
+		vhost_poll_queue(&vq->poll);
+	} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+		vhost_disable_notify(&net->dev, vq);
+		vhost_poll_queue(&vq->poll);
+	}
+}
+
+static void vhost_net_busy_poll(struct vhost_net *net,
+				struct vhost_virtqueue *rvq,
+				struct vhost_virtqueue *tvq,
+				bool *busyloop_intr,
+				bool poll_rx)
+{
+	unsigned long busyloop_timeout;
+	unsigned long endtime;
+	struct socket *sock;
+	struct vhost_virtqueue *vq = poll_rx ? tvq : rvq;
+
+	mutex_lock_nested(&vq->mutex, poll_rx ? VHOST_NET_VQ_TX: VHOST_NET_VQ_RX);
+	vhost_disable_notify(&net->dev, vq);
+	sock = rvq->private_data;
+
+	busyloop_timeout = poll_rx ? rvq->busyloop_timeout:
+				     tvq->busyloop_timeout;
+
+	preempt_disable();
+	endtime = busy_clock() + busyloop_timeout;
+
+	while (vhost_can_busy_poll(endtime)) {
+		if (vhost_has_work(&net->dev)) {
+			*busyloop_intr = true;
+			break;
+		}
+
+		if ((sock_has_rx_data(sock) &&
+		     !vhost_vq_avail_empty(&net->dev, rvq)) ||
+		    !vhost_vq_avail_empty(&net->dev, tvq))
+			break;
+
+		cpu_relax();
+	}
+
+	preempt_enable();
+
+	if (poll_rx || sock_has_rx_data(sock))
+		vhost_net_busy_poll_try_queue(net, vq);
+	else if (!poll_rx) /* On tx here, sock has no rx data. */
+		vhost_enable_notify(&net->dev, rvq);
+
+	mutex_unlock(&vq->mutex);
+}
+
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct vhost_net_virtqueue *nvq,
 				    unsigned int *out_num, unsigned int *in_num,
@@ -753,16 +820,6 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	return len;
 }
 
-static int sk_has_rx_data(struct sock *sk)
-{
-	struct socket *sock = sk->sk_socket;
-
-	if (sock->ops->peek_len)
-		return sock->ops->peek_len(sock);
-
-	return skb_queue_empty(&sk->sk_receive_queue);
-}
-
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 				      bool *busyloop_intr)
 {
@@ -770,41 +827,13 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 	struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *rvq = &rnvq->vq;
 	struct vhost_virtqueue *tvq = &tnvq->vq;
-	unsigned long uninitialized_var(endtime);
 	int len = peek_head_len(rnvq, sk);
 
-	if (!len && tvq->busyloop_timeout) {
+	if (!len && rvq->busyloop_timeout) {
 		/* Flush batched heads first */
 		vhost_net_signal_used(rnvq);
 		/* Both tx vq and rx socket were polled here */
-		mutex_lock_nested(&tvq->mutex, VHOST_NET_VQ_TX);
-		vhost_disable_notify(&net->dev, tvq);
-
-		preempt_disable();
-		endtime = busy_clock() + tvq->busyloop_timeout;
-
-		while (vhost_can_busy_poll(endtime)) {
-			if (vhost_has_work(&net->dev)) {
-				*busyloop_intr = true;
-				break;
-			}
-			if ((sk_has_rx_data(sk) &&
-			     !vhost_vq_avail_empty(&net->dev, rvq)) ||
-			    !vhost_vq_avail_empty(&net->dev, tvq))
-				break;
-			cpu_relax();
-		}
-
-		preempt_enable();
-
-		if (!vhost_vq_avail_empty(&net->dev, tvq)) {
-			vhost_poll_queue(&tvq->poll);
-		} else if (unlikely(vhost_enable_notify(&net->dev, tvq))) {
-			vhost_disable_notify(&net->dev, tvq);
-			vhost_poll_queue(&tvq->poll);
-		}
-
-		mutex_unlock(&tvq->mutex);
+		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, true);
 
 		len = peek_head_len(rnvq, sk);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v9 4/6] net: vhost: add rx busy polling in tx path
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: virtualization, netdev, Tonghao Zhang
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch improves the guest receive performance.
On the handle_tx side, we poll the sock receive queue at the
same time. handle_rx do that in the same way.

We set the poll-us=100us and use the netperf to test throughput
and mean latency. When running the tests, the vhost-net kthread
of that VM, is alway 100% CPU. The commands are shown as below.

Rx performance is greatly improved by this patch. There is not
notable performance change on tx with this series though. This
patch is useful for bi-directional traffic.

netperf -H IP -t TCP_STREAM -l 20 -- -O "THROUGHPUT, THROUGHPUT_UNITS, MEAN_LATENCY"

Topology:
[Host] ->linux bridge -> tap vhost-net ->[Guest]

TCP_STREAM:
* Without the patch:  19842.95 Mbps, 6.50 us mean latency
* With the patch:     37598.20 Mbps, 3.43 us mean latency

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 drivers/vhost/net.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3f68265..d372c14 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -508,31 +508,24 @@ static void vhost_net_busy_poll(struct vhost_net *net,
 }
 
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
-				    struct vhost_net_virtqueue *nvq,
+				    struct vhost_net_virtqueue *tnvq,
 				    unsigned int *out_num, unsigned int *in_num,
 				    bool *busyloop_intr)
 {
-	struct vhost_virtqueue *vq = &nvq->vq;
-	unsigned long uninitialized_var(endtime);
-	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+	struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
+	struct vhost_virtqueue *rvq = &rnvq->vq;
+	struct vhost_virtqueue *tvq = &tnvq->vq;
+
+	int r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
 				  out_num, in_num, NULL, NULL);
 
-	if (r == vq->num && vq->busyloop_timeout) {
-		if (!vhost_sock_zcopy(vq->private_data))
-			vhost_net_signal_used(nvq);
-		preempt_disable();
-		endtime = busy_clock() + vq->busyloop_timeout;
-		while (vhost_can_busy_poll(endtime)) {
-			if (vhost_has_work(vq->dev)) {
-				*busyloop_intr = true;
-				break;
-			}
-			if (!vhost_vq_avail_empty(vq->dev, vq))
-				break;
-			cpu_relax();
-		}
-		preempt_enable();
-		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+	if (r == tvq->num && tvq->busyloop_timeout) {
+		if (!vhost_sock_zcopy(tvq->private_data))
+			vhost_net_signal_used(tnvq);
+
+		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, false);
+
+		r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
 				      out_num, in_num, NULL, NULL);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net 10/13] net: sched: cls_u32: the tp_c argument of u32_set_parms() is always tp->data
From: Jamal Hadi Salim @ 2018-09-09 12:48 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-11-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 


Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 11/13] net: sched: cls_u32: get rid of hnode ->tp_c and tp_c argument of u32_set_parms()
From: Jamal Hadi Salim @ 2018-09-09 12:49 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-12-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 12/13] net: sched: cls_u32: keep track of knodes count in tc_u_common
From: Jamal Hadi Salim @ 2018-09-09 12:50 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-13-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> allows to simplify u32_delete() considerably
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 13/13] net: sched: cls_u32: simplify the hell out u32_delete() emptiness check
From: Jamal Hadi Salim @ 2018-09-09 12:51 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-14-viro@ZenIV.linux.org.uk>

On 2018-09-08 9:31 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> Now that we have the knode count, we can instantly check if
> any hnodes are non-empty.  And that kills the check for extra
> references to root hnode - those could happen only if there was
> a knode to carry such a link.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 00/13] cls_u32 cleanups and fixes.
From: Jamal Hadi Salim @ 2018-09-09 12:58 UTC (permalink / raw)
  To: Al Viro, netdev; +Cc: Cong Wang, Jiri Pirko
In-Reply-To: <20180909013132.3222-1-viro@ZenIV.linux.org.uk>


Since you have the momentum here: i noticed something
unusual while i was trying to craft a test that would
vet some of your changes. This has nothing to do with
your changes, same happens on my stock debian laptop
with kernel:
4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1 (2018-08-27)

Looking at git - possibly introduced around the time u32
lockless was being introduced and maybe even earlier
than that. Unfortunately i dont have time to dig
further.

To reproduce what i am referring to, here's a setup:

$tc filter add dev eth0 parent ffff: protocol ip prio 102 u32 \
classid 1:2 match ip src 192.168.8.0/8
$tc filter replace dev eth0 parent ffff: protocol ip prio 102 \
handle 800:0:800 u32 classid 1:2 match ip src 1.1.0.0/24

u32_change() code path should have allowed changing of the
keynode.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] tcp: show number of network segments in some SNMP counters
From: Eric Dumazet @ 2018-09-09 18:27 UTC (permalink / raw)
  To: Yafang Shao; +Cc: David Miller, netdev, LKML
In-Reply-To: <1536462862-11767-1-git-send-email-laoar.shao@gmail.com>

On Sat, Sep 8, 2018 at 8:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> It is better to show the number of network segments in bellow SNMP
> counters, because that could be more useful for the user.
> For example, the user could easily figure out how mant packets are
> dropped and how many packets are queued in the out-of-oder queue.
>
> - LINUX_MIB_TCPRCVQDROP
> - LINUX_MIB_TCPZEROWINDOWDROP
> - LINUX_MIB_TCPBACKLOGDROP
> - LINUX_MIB_TCPMINTTLDROP
> - LINUX_MIB_TCPOFODROP
> - LINUX_MIB_TCPOFOQUEUE
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  net/ipv4/tcp_input.c | 18 ++++++++++++------
>  net/ipv4/tcp_ipv4.c  |  9 ++++++---
>  net/ipv6/tcp_ipv6.c  |  6 ++++--
>  3 files changed, 22 insertions(+), 11 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 62508a2..c2ce334 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4496,7 +4496,8 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
>         tcp_ecn_check_ce(sk, skb);
>
>         if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
> -               NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
> +               NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP,
> +                             max_t(u16, 1, skb_shinfo(skb)->gso_segs));
>

I am nacking this patch. These counters are counting events really.

^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] tcp: fix the error count of tcpInSegs
From: Eric Dumazet @ 2018-09-09 18:32 UTC (permalink / raw)
  To: Yafang Shao; +Cc: David Miller, netdev, LKML
In-Reply-To: <1536462862-11767-2-git-send-email-laoar.shao@gmail.com>

On Sat, Sep 8, 2018 at 8:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> In RFC1213, the tcpInSegs is the total number of segments received.
> While currently it is the total number of SKBs received.
> The number of SKBs may be not equal with the numer of segments because of
> GRO.
> So fix this error count.
>

We have discussed this in the past and the consensus was it was too
late to change this.

IP counters have the same issue, so after your patch, we would have
quite a difference between transport and network layers.

Adding all these max_t(u16, 1, skb_shinfo(skb)->gso_segs)) everywhere add a cost

^ permalink raw reply

* Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables
From: Eric Dumazet @ 2018-09-09 18:38 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Herbert Xu, David Miller, Neil Horman, Marcelo Ricardo Leitner,
	Vladislav Yasevich, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	linux-crypto, LKML, linux-sctp, netdev, linux-decnet-user,
	kernel-team, Yuchung Cheng, Neal Cardwell
In-Reply-To: <CAOesGMi31UA2d-Bj2jo53Wz_YV424-rD3qk9rS5_-Yng0VC=0w@mail.gmail.com>

On Sat, Sep 8, 2018 at 10:02 AM Olof Johansson <olof@lixom.net> wrote:
>
> Hi,
>
> On Fri, Sep 7, 2018 at 12:21 AM, Eric Dumazet <edumazet@google.com> wrote:
> > On Fri, Sep 7, 2018 at 12:03 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> >> Problem is : we have platforms with more than 100 cpus, and
> >> sk_memory_allocated() cost will be too expensive,
> >> especially if the host is under memory pressure, since all cpus will
> >> touch their private counter.
> >>
> >> per cpu variables do not really scale, they were ok 10 years ago when
> >> no more than 16 cpus were the norm.
> >>
> >> I would prefer change TCP to not aggressively call
> >> __sk_mem_reduce_allocated() from tcp_write_timer()
> >>
> >> Ideally only tcp_retransmit_timer() should attempt to reduce forward
> >> allocations, after recurring timeout.
> >>
> >> Note that after 20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d ("net: avoid
> >> sk_forward_alloc overflows")
> >> we have better control over sockets having huge forward allocations.
> >>
> >> Something like :
> >
> > Or something less risky :
>
> I gave both of these patches a run, and neither do as well on the
> system that has slower atomics. :(
>
> The percpu version:
>
>      8.05%  workload         [kernel.vmlinux]
>     [k] __do_softirq
>      7.04%  swapper          [kernel.vmlinux]
>     [k] cpuidle_enter_state
>      5.54%  workload         [kernel.vmlinux]
>     [k] _raw_spin_unlock_irqrestore
>      1.66%  swapper          [kernel.vmlinux]
>     [k] __do_softirq
>      1.55%  workload         [kernel.vmlinux]
>     [k] finish_task_switch
>      1.24%  swapper          [kernel.vmlinux]
>     [k] finish_task_switch
>      1.07%  workload         [kernel.vmlinux]
>     [k] net_rx_action
>
> The first patch from you still has significant amount of time spent in
> the atomics paths (non-inlined versions used):
>
>      7.87%  workload         [kernel.vmlinux]
> [k] __ll_sc_atomic64_sub


The second patch I gave should not enter this path at all, please try it.

>      7.48%  workload         [kernel.vmlinux]
> [k] __do_softirq
>      5.05%  workload         [kernel.vmlinux]
> [k] _raw_spin_unlock_irqrestore
>      2.42%  workload         [kernel.vmlinux]
> [k] __ll_sc_atomic64_add_return
>      1.49%  swapper          [kernel.vmlinux]
> [k] cpuidle_enter_state
>      1.31%  workload         [kernel.vmlinux]
> [k] finish_task_switch
>      1.09%  workload         [kernel.vmlinux]
> [k] tcp_sendmsg_locked
>      1.08%  workload         [kernel.vmlinux]
> [k] __arch_copy_from_user
>      1.02%  workload         [kernel.vmlinux]
> [k] net_rx_action
>
> I think a lot of the overhead from percpu approach can be alleviated
> if we can use percpu_counter_read() instead of _sum() (i.e. no need to
> iterate through the local per-cpu recent delta). I don't know the TCP
> stack well enough to tell where it's OK to use a bit of slack in the
> numbers though -- by default count will at most be off by 32*online
> cpus. Might not be a significant number in reality.

^ permalink raw reply

* [PATCH] linux-next :cfg80211: Fix warnings while make xmldocs
From: Masanari Iida @ 2018-09-09 13:55 UTC (permalink / raw)
  To: davem, linux-kernel, sgruszka, johannes.berg, netdev, corbet,
	linux-doc
  Cc: Masanari Iida

During 4.19-rc1 merger period, 38cb87ee47fb8 was merged.
In the patch, one of an argument "*ptr" was removed from
the function. But in the comment of the function "@ptr"
still exist. Which causes 109 lines of following warning
during "make xmldocs"

./include/net/cfg80211.h:4869: warning: Excess function
 parameter 'ptr' description in 'reg_query_regdb_wmm'

After apply this patch and make xmldocs does not show
this message any more.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
---
 include/net/cfg80211.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
index 8ebabc9873d1..4de121e24ce5 100644
--- a/include/net/cfg80211.h
+++ b/include/net/cfg80211.h
@@ -4852,8 +4852,6 @@ const char *reg_initiator_name(enum nl80211_reg_initiator initiator);
  *
  * @alpha2: the ISO/IEC 3166 alpha2 wmm rule to be queried.
  * @freq: the freqency(in MHz) to be queried.
- * @ptr: pointer where the regdb wmm data is to be stored (or %NULL if
- *	irrelevant). This can be used later for deduplication.
  * @rule: pointer to store the wmm rule from the regulatory db.
  *
  * Self-managed wireless drivers can use this function to  query
-- 
2.19.0.rc2

^ permalink raw reply related

* Re: [PATCH net 00/13] cls_u32 cleanups and fixes.
From: Al Viro @ 2018-09-09 14:15 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: netdev, Cong Wang, Jiri Pirko
In-Reply-To: <5f835727-3f86-1532-889f-d11ae17e16d4@mojatatu.com>

On Sun, Sep 09, 2018 at 08:58:50AM -0400, Jamal Hadi Salim wrote:
> 
> Since you have the momentum here: i noticed something
> unusual while i was trying to craft a test that would
> vet some of your changes. This has nothing to do with
> your changes, same happens on my stock debian laptop
> with kernel:
> 4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1 (2018-08-27)
> 
> Looking at git - possibly introduced around the time u32
> lockless was being introduced and maybe even earlier
> than that.

It's always been that way, actually - before that point the old
knode simply got reused, which excluded any chance of changing
n->sel.

> Unfortunately i dont have time to dig
> further.
> 
> To reproduce what i am referring to, here's a setup:
> 
> $tc filter add dev eth0 parent ffff: protocol ip prio 102 u32 \
> classid 1:2 match ip src 192.168.8.0/8
> $tc filter replace dev eth0 parent ffff: protocol ip prio 102 \
> handle 800:0:800 u32 classid 1:2 match ip src 1.1.0.0/24
> 
> u32_change() code path should have allowed changing of the
> keynode.

Umm...  Interesting - TCA_U32_SEL is not the only thing that
gets ignored there; TCA_U32_MARK gets the same treatment.
And then there's a lovely question what to do with n->pf -
it's an array of n->sel.nkeys counters, and apparently we
want (at least in common cases) to avoid resetting those.

*If* we declare that ->nkeys mismatch means failure, it's
all relatively easy to implement.  Alternatively, we could
declare that selector change means resetting the stats.
Preferences?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox