Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2] team: Add vlan tx offload to hw_enc_features
From: David Miller @ 2019-08-09  5:42 UTC (permalink / raw)
  To: yuehaibing
  Cc: j.vosburgh, vfalico, andy, jiri, jay.vosburgh, linux-kernel,
	netdev
In-Reply-To: <20190808062247.38352-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Thu, 8 Aug 2019 14:22:47 +0800

> We should also enable team's vlan tx offload in hw_enc_features,
> pass the vlan packets to the slave devices with vlan tci, let the
> slave handle vlan tunneling offload implementation.
> 
> Fixes: 3268e5cb494d ("team: Advertise tunneling offload features")
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
> v2: fix commit log typo

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH v2] net: tundra: tsi108: use spin_lock_irqsave instead of spin_lock_irq in IRQ context
From: David Miller @ 2019-08-09  5:43 UTC (permalink / raw)
  To: huangfq.daxian; +Cc: netdev, linux-kernel
In-Reply-To: <20190809053539.8341-1-huangfq.daxian@gmail.com>

From: Fuqian Huang <huangfq.daxian@gmail.com>
Date: Fri,  9 Aug 2019 13:35:39 +0800

> As spin_unlock_irq will enable interrupts.
> Function tsi108_stat_carry is called from interrupt handler tsi108_irq.
> Interrupts are enabled in interrupt handler.
> Use spin_lock_irqsave/spin_unlock_irqrestore instead of spin_(un)lock_irq
> in IRQ context to avoid this.
> 
> Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
> ---
> Changes in v2:
>   - Preserve reverse christmas tree ordering of local variables.

Applied, thanks.

^ permalink raw reply

* [PATCH V5 1/9] vhost: don't set uaddr for invalid address
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We should not setup uaddr for the invalid address, otherwise we may
try to pin or prefetch mapping of wrong pages.

Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0536f8526359..488380a581dc 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2082,7 +2082,8 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	}
 
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
-	vhost_setup_vq_uaddr(vq);
+	if (r == 0)
+		vhost_setup_vq_uaddr(vq);
 
 	if (d->mm)
 		mmu_notifier_register(&d->mmu_notifier, d->mm);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 2/9] vhost: validate MMU notifier registration
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

The return value of mmu_notifier_register() is not checked in
vhost_vring_set_num_addr(). This will cause an out of sync between mm
and MMU notifier thus a double free. To solve this, introduce a
boolean flag to track whether MMU notifier is registered and only do
unregistering when it was true.

Reported-and-tested-by:
syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 19 +++++++++++++++----
 drivers/vhost/vhost.h |  1 +
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 488380a581dc..17f6abea192e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -629,6 +629,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->iov_limit = iov_limit;
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
+	dev->has_notifier = false;
 	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
@@ -730,6 +731,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	if (err)
 		goto err_mmu_notifier;
 #endif
+	dev->has_notifier = true;
 
 	return 0;
 
@@ -959,7 +961,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	}
 	if (dev->mm) {
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
-		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
+		if (dev->has_notifier) {
+			mmu_notifier_unregister(&dev->mmu_notifier,
+						dev->mm);
+			dev->has_notifier = false;
+		}
 #endif
 		mmput(dev->mm);
 	}
@@ -2064,8 +2070,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	/* Unregister MMU notifer to allow invalidation callback
 	 * can access vq->uaddrs[] without holding a lock.
 	 */
-	if (d->mm)
+	if (d->has_notifier) {
 		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
+		d->has_notifier = false;
+	}
 
 	vhost_uninit_vq_maps(vq);
 #endif
@@ -2085,8 +2093,11 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	if (r == 0)
 		vhost_setup_vq_uaddr(vq);
 
-	if (d->mm)
-		mmu_notifier_register(&d->mmu_notifier, d->mm);
+	if (d->mm) {
+		r = mmu_notifier_register(&d->mmu_notifier, d->mm);
+		if (!r)
+			d->has_notifier = true;
+	}
 #endif
 
 	mutex_unlock(&vq->mutex);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 42a8c2a13ab1..a9a2a93857d2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -214,6 +214,7 @@ struct vhost_dev {
 	int iov_limit;
 	int weight;
 	int byte_weight;
+	bool has_notifier;
 };
 
 bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 3/9] vhost: fix vhost map leak
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We don't free map during vhost_map_unprefetch(). This means it could
be leaked. Fixing by free the map.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 17f6abea192e..2a3154976277 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -302,9 +302,7 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 static void vhost_map_unprefetch(struct vhost_map *map)
 {
 	kfree(map->pages);
-	map->pages = NULL;
-	map->npages = 0;
-	map->addr = NULL;
+	kfree(map);
 }
 
 static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr()
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

The vhost_set_vring_num_addr() could be called in the middle of
invalidate_range_start() and invalidate_range_end(). If we don't reset
invalidate_count after the un-registering of MMU notifier, the
invalidate_cont will run out of sync (e.g never reach zero). This will
in fact disable the fast accessor path. Fixing by reset the count to
zero.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2a3154976277..2a7217c33668 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2073,6 +2073,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 		d->has_notifier = false;
 	}
 
+	/* reset invalidate_count in case we are in the middle of
+	 * invalidate_start() and invalidate_end().
+	 */
+	vq->invalidate_count = 0;
 	vhost_uninit_vq_maps(vq);
 #endif
 
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 5/9] vhost: mark dirty pages during map uninit
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We don't mark dirty pages if the map was teared down outside MMU
notifier. This will lead untracked dirty pages. Fixing by marking
dirty pages during map uninit.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2a7217c33668..c12cdadb0855 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -305,6 +305,18 @@ static void vhost_map_unprefetch(struct vhost_map *map)
 	kfree(map);
 }
 
+static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
+				struct vhost_map *map, int index)
+{
+	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
+	int i;
+
+	if (uaddr->write) {
+		for (i = 0; i < map->npages; i++)
+			set_page_dirty(map->pages[i]);
+	}
+}
+
 static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 {
 	struct vhost_map *map[VHOST_NUM_ADDRS];
@@ -314,8 +326,10 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
 		map[i] = rcu_dereference_protected(vq->maps[i],
 				  lockdep_is_held(&vq->mmu_lock));
-		if (map[i])
+		if (map[i]) {
+			vhost_set_map_dirty(vq, map[i], i);
 			rcu_assign_pointer(vq->maps[i], NULL);
+		}
 	}
 	spin_unlock(&vq->mmu_lock);
 
@@ -353,7 +367,6 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 {
 	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
 	struct vhost_map *map;
-	int i;
 
 	if (!vhost_map_range_overlap(uaddr, start, end))
 		return;
@@ -364,10 +377,7 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	map = rcu_dereference_protected(vq->maps[index],
 					lockdep_is_held(&vq->mmu_lock));
 	if (map) {
-		if (uaddr->write) {
-			for (i = 0; i < map->npages; i++)
-				set_page_dirty(map->pages[i]);
-		}
+		vhost_set_map_dirty(vq, map, index);
 		rcu_assign_pointer(vq->maps[index], NULL);
 	}
 	spin_unlock(&vq->mmu_lock);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 6/9] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps()
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

There's no need for RCU synchronization in vhost_uninit_vq_maps()
since we've already serialized with readers (memory accessors). This
also avoid the possible userspace DOS through ioctl() because of the
possible high latency caused by synchronize_rcu().

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c12cdadb0855..cfc11f9ed9c9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -333,7 +333,9 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 	}
 	spin_unlock(&vq->mmu_lock);
 
-	synchronize_rcu();
+	/* No need for synchronize_rcu() or kfree_rcu() since we are
+	 * serialized with memory accessors (e.g vq mutex held).
+	 */
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++)
 		if (map[i])
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We used to use RCU to synchronize MMU notifier with worker. This leads
calling synchronize_rcu() in invalidate_range_start(). But on a busy
system, there would be many factors that may slow down the
synchronize_rcu() which makes it unsuitable to be called in MMU
notifier. This path switch to use a simple spinlock to do the
synchronization.

Benchmark was done through testpmd + vhost_net + XDP_DROP on
tap. Compare to copy_{to|from}_user() path, on Sandy Bridge (without
SMAP support), 1.5% PPS improvement was measured; on Broadwell (with
SMAP and enabled), 14% PPS improvement was measured.

This means we are not as fast as what 7f466032dc9e did because the
spinlock overhead in the datapath. This needs to be addressed in the
future.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 115 ++++++++++++++++++++++--------------------
 drivers/vhost/vhost.h |   5 +-
 2 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cfc11f9ed9c9..29e8abe694f7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 
 	spin_lock(&vq->mmu_lock);
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
-		map[i] = rcu_dereference_protected(vq->maps[i],
-				  lockdep_is_held(&vq->mmu_lock));
+		map[i] = vq->maps[i];
 		if (map[i]) {
 			vhost_set_map_dirty(vq, map[i], i);
-			rcu_assign_pointer(vq->maps[i], NULL);
+			vq->maps[i] = NULL;
 		}
 	}
 	spin_unlock(&vq->mmu_lock);
 
-	/* No need for synchronize_rcu() or kfree_rcu() since we are
-	 * serialized with memory accessors (e.g vq mutex held).
+	/* No need for synchronization since we are serialized with
+	 * memory accessors (e.g vq mutex held).
 	 */
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++)
@@ -362,6 +361,16 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
 	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
 }
 
+static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
+{
+	spin_lock(&vq->mmu_lock);
+}
+
+static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
+{
+	spin_unlock(&vq->mmu_lock);
+}
+
 static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 				      int index,
 				      unsigned long start,
@@ -376,16 +385,14 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	spin_lock(&vq->mmu_lock);
 	++vq->invalidate_count;
 
-	map = rcu_dereference_protected(vq->maps[index],
-					lockdep_is_held(&vq->mmu_lock));
+	map = vq->maps[index];
 	if (map) {
+		vq->maps[index] = NULL;
 		vhost_set_map_dirty(vq, map, index);
-		rcu_assign_pointer(vq->maps[index], NULL);
 	}
 	spin_unlock(&vq->mmu_lock);
 
 	if (map) {
-		synchronize_rcu();
 		vhost_map_unprefetch(map);
 	}
 }
@@ -457,7 +464,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
 		for (j = 0; j < VHOST_NUM_ADDRS; j++)
-			RCU_INIT_POINTER(vq->maps[j], NULL);
+			vq->maps[j] = NULL;
 	}
 }
 #endif
@@ -921,7 +928,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
 	map->npages = npages;
 	map->pages = pages;
 
-	rcu_assign_pointer(vq->maps[index], map);
+	vq->maps[index] = map;
 	/* No need for a synchronize_rcu(). This function should be
 	 * called by dev->worker so we are serialized with all
 	 * readers.
@@ -1216,18 +1223,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			*((__virtio16 *)&used->ring[vq->num]) =
 				cpu_to_vhost16(vq, vq->avail_idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1245,18 +1252,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
 	size_t size;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			size = count * sizeof(*head);
 			memcpy(used->ring + idx, head, size);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1272,17 +1279,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			used->flags = cpu_to_vhost16(vq, vq->used_flags);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1298,17 +1305,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1362,17 +1369,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*idx = avail->idx;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1387,17 +1394,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*head = avail->ring[idx & (vq->num - 1)];
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1413,17 +1420,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*flags = avail->flags;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1438,15 +1445,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		vhost_vq_access_map_begin(vq);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*event = (__virtio16)avail->ring[vq->num];
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1461,17 +1468,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			*idx = used->idx;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1486,17 +1493,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
 	struct vring_desc *d;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
+		map = vq->maps[VHOST_ADDR_DESC];
 		if (likely(map)) {
 			d = map->addr;
 			*desc = *(d + idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1843,13 +1850,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
 static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
 {
-	struct vhost_map __rcu *map;
+	struct vhost_map *map;
 	int i;
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
-		rcu_read_lock();
-		map = rcu_dereference(vq->maps[i]);
-		rcu_read_unlock();
+		map = vq->maps[i];
 		if (unlikely(!map))
 			vhost_map_prefetch(vq, i);
 	}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a9a2a93857d2..983d06e62f12 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -115,10 +115,9 @@ struct vhost_virtqueue {
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
 	/* Read by memory accessors, modified by meta data
 	 * prefetching, MMU notifier and vring ioctl().
-	 * Synchonrized through mmu_lock (writers) and RCU (writers
-	 * and readers).
+	 * Synchonrized through mmu_lock.
 	 */
-	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
+	struct vhost_map *maps[VHOST_NUM_ADDRS];
 	/* Read by MMU notifier, modified by vring ioctl(),
 	 * synchronized through MMU notifier
 	 * registering/unregistering.
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 8/9] vhost: correctly set dirty pages in MMU notifiers callback
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We need make sure there's no reference on the map before trying to
mark set dirty pages.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 29e8abe694f7..d8863aaaf0f6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -386,13 +386,12 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	++vq->invalidate_count;
 
 	map = vq->maps[index];
-	if (map) {
+	if (map)
 		vq->maps[index] = NULL;
-		vhost_set_map_dirty(vq, map, index);
-	}
 	spin_unlock(&vq->mmu_lock);
 
 	if (map) {
+		vhost_set_map_dirty(vq, map, index);
 		vhost_map_unprefetch(map);
 	}
 }
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 9/9] vhost: do not return -EAGAIN for non blocking invalidation too early
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

Instead of returning -EAGAIN unconditionally, we'd better do that only
we're sure the range is overlapped with the metadata area.

Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d8863aaaf0f6..f98155f28f02 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -371,16 +371,19 @@ static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
 	spin_unlock(&vq->mmu_lock);
 }
 
-static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
-				      int index,
-				      unsigned long start,
-				      unsigned long end)
+static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
+				     int index,
+				     unsigned long start,
+				     unsigned long end,
+				     bool blockable)
 {
 	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
 	struct vhost_map *map;
 
 	if (!vhost_map_range_overlap(uaddr, start, end))
-		return;
+		return 0;
+	else if (!blockable)
+		return -EAGAIN;
 
 	spin_lock(&vq->mmu_lock);
 	++vq->invalidate_count;
@@ -394,6 +397,8 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 		vhost_set_map_dirty(vq, map, index);
 		vhost_map_unprefetch(map);
 	}
+
+	return 0;
 }
 
 static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
@@ -414,18 +419,19 @@ static int vhost_invalidate_range_start(struct mmu_notifier *mn,
 {
 	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
 					     mmu_notifier);
-	int i, j;
-
-	if (!mmu_notifier_range_blockable(range))
-		return -EAGAIN;
+	bool blockable = mmu_notifier_range_blockable(range);
+	int i, j, ret;
 
 	for (i = 0; i < dev->nvqs; i++) {
 		struct vhost_virtqueue *vq = dev->vqs[i];
 
-		for (j = 0; j < VHOST_NUM_ADDRS; j++)
-			vhost_invalidate_vq_start(vq, j,
-						  range->start,
-						  range->end);
+		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
+			ret = vhost_invalidate_vq_start(vq, j,
+							range->start,
+							range->end, blockable);
+			if (ret)
+				return ret;
+		}
 	}
 
 	return 0;
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 0/9] Fixes for vhost metadata acceleration
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang

Hi all:

This series try to fix several issues introduced by meta data
accelreation series. Please review.

Changes from V4:
- switch to use spinlock synchronize MMU notifier with accessors

Changes from V3:
- remove the unnecessary patch

Changes from V2:
- use seqlck helper to synchronize MMU notifier with vhost worker

Changes from V1:
- try not use RCU to syncrhonize MMU notifier with vhost worker
- set dirty pages after no readers
- return -EAGAIN only when we find the range is overlapped with
  metadata

Jason Wang (9):
  vhost: don't set uaddr for invalid address
  vhost: validate MMU notifier registration
  vhost: fix vhost map leak
  vhost: reset invalidate_count in vhost_set_vring_num_addr()
  vhost: mark dirty pages during map uninit
  vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps()
  vhost: do not use RCU to synchronize MMU notifier with worker
  vhost: correctly set dirty pages in MMU notifiers callback
  vhost: do not return -EAGAIN for non blocking invalidation too early

 drivers/vhost/vhost.c | 202 +++++++++++++++++++++++++-----------------
 drivers/vhost/vhost.h |   6 +-
 2 files changed, 122 insertions(+), 86 deletions(-)

-- 
2.18.1


^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-08-09  6:25 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: netdev, David Miller, Jakub Kicinski, Stephen Hemminger,
	David Ahern, dcbw, Michal Kubecek, Andrew Lunn, parav,
	Saeed Mahameed, mlxsw
In-Reply-To: <CAJieiUi+gKKc94bKfC-N5LBc=FdzGGo_8+x2oTstihFaUpkKSA@mail.gmail.com>

Fri, Aug 09, 2019 at 06:11:30AM CEST, roopa@cumulusnetworks.com wrote:
>On Fri, Jul 19, 2019 at 4:00 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> Add two commands to add and delete alternative ifnames for net device.
>> Each net device can have multiple alternative names.
>>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  include/linux/netdevice.h      |   4 ++
>>  include/uapi/linux/if.h        |   1 +
>>  include/uapi/linux/if_link.h   |   1 +
>>  include/uapi/linux/rtnetlink.h |   7 +++
>>  net/core/dev.c                 |  58 ++++++++++++++++++-
>>  net/core/rtnetlink.c           | 102 +++++++++++++++++++++++++++++++++
>>  security/selinux/nlmsgtab.c    |   4 +-
>>  7 files changed, 175 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 74f99f127b0e..6922fdb483ca 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -920,10 +920,14 @@ struct tlsdev_ops;
>>
>>  struct netdev_name_node {
>>         struct hlist_node hlist;
>> +       struct list_head list;
>>         struct net_device *dev;
>>         char *name;
>>  };
>>
>> +int netdev_name_node_alt_create(struct net_device *dev, char *name);
>> +int netdev_name_node_alt_destroy(struct net_device *dev, char *name);
>> +
>>  /*
>>   * This structure defines the management hooks for network devices.
>>   * The following hooks can be defined; unless noted otherwise, they are
>> diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
>> index 7fea0fd7d6f5..4bf33344aab1 100644
>> --- a/include/uapi/linux/if.h
>> +++ b/include/uapi/linux/if.h
>> @@ -33,6 +33,7 @@
>>  #define        IFNAMSIZ        16
>>  #endif /* __UAPI_DEF_IF_IFNAMSIZ */
>>  #define        IFALIASZ        256
>> +#define        ALTIFNAMSIZ     128
>>  #include <linux/hdlc/ioctl.h>
>>
>>  /* For glibc compatibility. An empty enum does not compile. */
>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>> index 4a8c02cafa9a..92268946e04a 100644
>> --- a/include/uapi/linux/if_link.h
>> +++ b/include/uapi/linux/if_link.h
>> @@ -167,6 +167,7 @@ enum {
>>         IFLA_NEW_IFINDEX,
>>         IFLA_MIN_MTU,
>>         IFLA_MAX_MTU,
>> +       IFLA_ALT_IFNAME_MOD, /* Alternative ifname to add/delete */
>>         __IFLA_MAX
>>  };
>>
>> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
>> index ce2a623abb75..b36cfd83eb76 100644
>> --- a/include/uapi/linux/rtnetlink.h
>> +++ b/include/uapi/linux/rtnetlink.h
>> @@ -164,6 +164,13 @@ enum {
>>         RTM_GETNEXTHOP,
>>  #define RTM_GETNEXTHOP RTM_GETNEXTHOP
>>
>> +       RTM_NEWALTIFNAME = 108,
>> +#define RTM_NEWALTIFNAME       RTM_NEWALTIFNAME
>> +       RTM_DELALTIFNAME,
>> +#define RTM_DELALTIFNAME       RTM_DELALTIFNAME
>> +       RTM_GETALTIFNAME,
>> +#define RTM_GETALTIFNAME       RTM_GETALTIFNAME
>> +
>
>I might have missed the prior discussion, why do we need new commands
>?. can't this simply be part of RTM_*LINK and we use RTM_SETLINK to
>set alternate names ?

How? This is to add/remove. How do you suggest to to add/remove by
setlink?


>
>
>
>>         __RTM_MAX,
>>  #define RTM_MAX                (((__RTM_MAX + 3) & ~3) - 1)
>>  };
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index ad0d42fbdeee..2a3be2b279d3 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -244,7 +244,13 @@ static struct netdev_name_node *netdev_name_node_alloc(struct net_device *dev,
>>  static struct netdev_name_node *
>>  netdev_name_node_head_alloc(struct net_device *dev)
>>  {
>> -       return netdev_name_node_alloc(dev, dev->name);
>> +       struct netdev_name_node *name_node;
>> +
>> +       name_node = netdev_name_node_alloc(dev, dev->name);
>> +       if (!name_node)
>> +               return NULL;
>> +       INIT_LIST_HEAD(&name_node->list);
>> +       return name_node;
>>  }
>>
>>  static void netdev_name_node_free(struct netdev_name_node *name_node)
>> @@ -288,6 +294,55 @@ static struct netdev_name_node *netdev_name_node_lookup_rcu(struct net *net,
>>         return NULL;
>>  }
>>
>> +int netdev_name_node_alt_create(struct net_device *dev, char *name)
>> +{
>> +       struct netdev_name_node *name_node;
>> +       struct net *net = dev_net(dev);
>> +
>> +       name_node = netdev_name_node_lookup(net, name);
>> +       if (name_node)
>> +               return -EEXIST;
>> +       name_node = netdev_name_node_alloc(dev, name);
>> +       if (!name_node)
>> +               return -ENOMEM;
>> +       netdev_name_node_add(net, name_node);
>> +       /* The node that holds dev->name acts as a head of per-device list. */
>> +       list_add_tail(&name_node->list, &dev->name_node->list);
>> +
>> +       return 0;
>> +}
>> +EXPORT_SYMBOL(netdev_name_node_alt_create);
>> +
>> +static void __netdev_name_node_alt_destroy(struct netdev_name_node *name_node)
>> +{
>> +       list_del(&name_node->list);
>> +       netdev_name_node_del(name_node);
>> +       kfree(name_node->name);
>> +       netdev_name_node_free(name_node);
>> +}
>> +
>> +int netdev_name_node_alt_destroy(struct net_device *dev, char *name)
>> +{
>> +       struct netdev_name_node *name_node;
>> +       struct net *net = dev_net(dev);
>> +
>> +       name_node = netdev_name_node_lookup(net, name);
>> +       if (!name_node)
>> +               return -ENOENT;
>> +       __netdev_name_node_alt_destroy(name_node);
>> +
>> +       return 0;
>> +}
>> +EXPORT_SYMBOL(netdev_name_node_alt_destroy);
>> +
>> +static void netdev_name_node_alt_flush(struct net_device *dev)
>> +{
>> +       struct netdev_name_node *name_node, *tmp;
>> +
>> +       list_for_each_entry_safe(name_node, tmp, &dev->name_node->list, list)
>> +               __netdev_name_node_alt_destroy(name_node);
>> +}
>> +
>>  /* Device list insertion */
>>  static void list_netdevice(struct net_device *dev)
>>  {
>> @@ -8258,6 +8313,7 @@ static void rollback_registered_many(struct list_head *head)
>>                 dev_uc_flush(dev);
>>                 dev_mc_flush(dev);
>>
>> +               netdev_name_node_alt_flush(dev);
>>                 netdev_name_node_free(dev->name_node);
>>
>>                 if (dev->netdev_ops->ndo_uninit)
>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>> index 1ee6460f8275..7a2010b16e10 100644
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -1750,6 +1750,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
>>         [IFLA_CARRIER_DOWN_COUNT] = { .type = NLA_U32 },
>>         [IFLA_MIN_MTU]          = { .type = NLA_U32 },
>>         [IFLA_MAX_MTU]          = { .type = NLA_U32 },
>> +       [IFLA_ALT_IFNAME_MOD]   = { .type = NLA_STRING,
>> +                                   .len = ALTIFNAMSIZ - 1 },
>>  };
>>
>>  static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
>> @@ -3373,6 +3375,103 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
>>         return err;
>>  }
>>
>> +static int rtnl_newaltifname(struct sk_buff *skb, struct nlmsghdr *nlh,
>> +                            struct netlink_ext_ack *extack)
>> +{
>> +       struct net *net = sock_net(skb->sk);
>> +       struct nlattr *tb[IFLA_MAX + 1];
>> +       struct net_device *dev;
>> +       struct ifinfomsg *ifm;
>> +       char *new_alt_ifname;
>> +       int err;
>> +
>> +       err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy, extack);
>> +       if (err)
>> +               return err;
>> +
>> +       err = rtnl_ensure_unique_netns(tb, extack, true);
>> +       if (err)
>> +               return err;
>> +
>> +       ifm = nlmsg_data(nlh);
>> +       if (ifm->ifi_index > 0) {
>> +               dev = __dev_get_by_index(net, ifm->ifi_index);
>> +       } else if (tb[IFLA_IFNAME]) {
>> +               char ifname[IFNAMSIZ];
>> +
>> +               nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
>> +               dev = __dev_get_by_name(net, ifname);
>> +       } else {
>> +               return -EINVAL;
>> +       }
>> +
>> +       if (!dev)
>> +               return -ENODEV;
>> +
>> +       if (!tb[IFLA_ALT_IFNAME_MOD])
>> +               return -EINVAL;
>> +
>> +       new_alt_ifname = nla_strdup(tb[IFLA_ALT_IFNAME_MOD], GFP_KERNEL);
>> +       if (!new_alt_ifname)
>> +               return -ENOMEM;
>> +
>> +       err = netdev_name_node_alt_create(dev, new_alt_ifname);
>> +       if (err)
>> +               goto out_free_new_alt_ifname;
>> +
>> +       return 0;
>> +
>> +out_free_new_alt_ifname:
>> +       kfree(new_alt_ifname);
>> +       return err;
>> +}
>> +
>> +static int rtnl_delaltifname(struct sk_buff *skb, struct nlmsghdr *nlh,
>> +                            struct netlink_ext_ack *extack)
>> +{
>> +       struct net *net = sock_net(skb->sk);
>> +       struct nlattr *tb[IFLA_MAX + 1];
>> +       struct net_device *dev;
>> +       struct ifinfomsg *ifm;
>> +       char *del_alt_ifname;
>> +       int err;
>> +
>> +       err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy, extack);
>> +       if (err)
>> +               return err;
>> +
>> +       err = rtnl_ensure_unique_netns(tb, extack, true);
>> +       if (err)
>> +               return err;
>> +
>> +       ifm = nlmsg_data(nlh);
>> +       if (ifm->ifi_index > 0) {
>> +               dev = __dev_get_by_index(net, ifm->ifi_index);
>> +       } else if (tb[IFLA_IFNAME]) {
>> +               char ifname[IFNAMSIZ];
>> +
>> +               nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
>> +               dev = __dev_get_by_name(net, ifname);
>> +       } else {
>> +               return -EINVAL;
>> +       }
>> +
>> +       if (!dev)
>> +               return -ENODEV;
>> +
>> +       if (!tb[IFLA_ALT_IFNAME_MOD])
>> +               return -EINVAL;
>> +
>> +       del_alt_ifname = nla_strdup(tb[IFLA_ALT_IFNAME_MOD], GFP_KERNEL);
>> +       if (!del_alt_ifname)
>> +               return -ENOMEM;
>> +
>> +       err = netdev_name_node_alt_destroy(dev, del_alt_ifname);
>> +       kfree(del_alt_ifname);
>> +
>> +       return err;
>> +}
>> +
>>  static u16 rtnl_calcit(struct sk_buff *skb, struct nlmsghdr *nlh)
>>  {
>>         struct net *net = sock_net(skb->sk);
>> @@ -5331,6 +5430,9 @@ void __init rtnetlink_init(void)
>>         rtnl_register(PF_UNSPEC, RTM_GETROUTE, NULL, rtnl_dump_all, 0);
>>         rtnl_register(PF_UNSPEC, RTM_GETNETCONF, NULL, rtnl_dump_all, 0);
>>
>> +       rtnl_register(PF_UNSPEC, RTM_NEWALTIFNAME, rtnl_newaltifname, NULL, 0);
>> +       rtnl_register(PF_UNSPEC, RTM_DELALTIFNAME, rtnl_delaltifname, NULL, 0);
>> +
>>         rtnl_register(PF_BRIDGE, RTM_NEWNEIGH, rtnl_fdb_add, NULL, 0);
>>         rtnl_register(PF_BRIDGE, RTM_DELNEIGH, rtnl_fdb_del, NULL, 0);
>>         rtnl_register(PF_BRIDGE, RTM_GETNEIGH, rtnl_fdb_get, rtnl_fdb_dump, 0);
>> diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
>> index 58345ba0528e..a712b54c666c 100644
>> --- a/security/selinux/nlmsgtab.c
>> +++ b/security/selinux/nlmsgtab.c
>> @@ -83,6 +83,8 @@ static const struct nlmsg_perm nlmsg_route_perms[] =
>>         { RTM_NEWNEXTHOP,       NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>>         { RTM_DELNEXTHOP,       NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>>         { RTM_GETNEXTHOP,       NETLINK_ROUTE_SOCKET__NLMSG_READ  },
>> +       { RTM_NEWALTIFNAME,     NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>> +       { RTM_DELALTIFNAME,     NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>>  };
>>
>>  static const struct nlmsg_perm nlmsg_tcpdiag_perms[] =
>> @@ -166,7 +168,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
>>                  * structures at the top of this file with the new mappings
>>                  * before updating the BUILD_BUG_ON() macro!
>>                  */
>> -               BUILD_BUG_ON(RTM_MAX != (RTM_NEWNEXTHOP + 3));
>> +               BUILD_BUG_ON(RTM_MAX != (RTM_NEWALTIFNAME + 3));
>>                 err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
>>                                  sizeof(nlmsg_route_perms));
>>                 break;
>> --
>> 2.21.0
>>

^ permalink raw reply

* [PATCH 3/3] tipc: fix issue of calling smp_processor_id() in preemptible
From: Ying Xue @ 2019-08-09  7:16 UTC (permalink / raw)
  To: davem, netdev; +Cc: jon.maloy, hdanton, tipc-discussion, syzkaller-bugs
In-Reply-To: <1565335017-21302-1-git-send-email-ying.xue@windriver.com>

syzbot found the following issue:

[   81.119772][ T8612] BUG: using smp_processor_id() in preemptible [00000000] code: syz-executor834/8612
[   81.136212][ T8612] caller is dst_cache_get+0x3d/0xb0
[   81.141450][ T8612] CPU: 0 PID: 8612 Comm: syz-executor834 Not tainted 5.2.0-rc6+ #48
[   81.149435][ T8612] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[   81.159480][ T8612] Call Trace:
[   81.162789][ T8612]  dump_stack+0x172/0x1f0
[   81.167123][ T8612]  debug_smp_processor_id+0x251/0x280
[   81.172479][ T8612]  dst_cache_get+0x3d/0xb0
[   81.176928][ T8612]  tipc_udp_xmit.isra.0+0xc4/0xb80
[   81.182046][ T8612]  ? kasan_kmalloc+0x9/0x10
[   81.186531][ T8612]  ? tipc_udp_addr2str+0x170/0x170
[   81.191641][ T8612]  ? __copy_skb_header+0x2e8/0x560
[   81.196750][ T8612]  ? __skb_checksum_complete+0x3f0/0x3f0
[   81.202364][ T8612]  ? netdev_alloc_frag+0x1b0/0x1b0
[   81.207452][ T8612]  ? skb_copy_header+0x21/0x2b0
[   81.212282][ T8612]  ? __pskb_copy_fclone+0x516/0xc90
[   81.217470][ T8612]  tipc_udp_send_msg+0x29a/0x4b0
[   81.222400][ T8612]  tipc_bearer_xmit_skb+0x16c/0x360
[   81.227585][ T8612]  tipc_enable_bearer+0xabe/0xd20
[   81.232606][ T8612]  ? __nla_validate_parse+0x2d0/0x1ee0
[   81.238048][ T8612]  ? tipc_bearer_xmit_skb+0x360/0x360
[   81.243401][ T8612]  ? nla_memcpy+0xb0/0xb0
[   81.247710][ T8612]  ? nla_memcpy+0xb0/0xb0
[   81.252020][ T8612]  ? __nla_parse+0x43/0x60
[   81.256417][ T8612]  __tipc_nl_bearer_enable+0x2de/0x3a0
[   81.261856][ T8612]  ? __tipc_nl_bearer_enable+0x2de/0x3a0
[   81.267467][ T8612]  ? tipc_nl_bearer_disable+0x40/0x40
[   81.272848][ T8612]  ? unwind_get_return_address+0x58/0xa0
[   81.278501][ T8612]  ? lock_acquire+0x16f/0x3f0
[   81.283190][ T8612]  tipc_nl_bearer_enable+0x23/0x40
[   81.288300][ T8612]  genl_family_rcv_msg+0x74b/0xf90
[   81.293404][ T8612]  ? genl_unregister_family+0x790/0x790
[   81.298935][ T8612]  ? __lock_acquire+0x54f/0x5490
[   81.303852][ T8612]  ? __netlink_lookup+0x3fa/0x7b0
[   81.308865][ T8612]  genl_rcv_msg+0xca/0x16c
[   81.313266][ T8612]  netlink_rcv_skb+0x177/0x450
[   81.318043][ T8612]  ? genl_family_rcv_msg+0xf90/0xf90
[   81.323311][ T8612]  ? netlink_ack+0xb50/0xb50
[   81.327906][ T8612]  ? lock_acquire+0x16f/0x3f0
[   81.332589][ T8612]  ? kasan_check_write+0x14/0x20
[   81.337511][ T8612]  genl_rcv+0x29/0x40
[   81.341485][ T8612]  netlink_unicast+0x531/0x710
[   81.346268][ T8612]  ? netlink_attachskb+0x770/0x770
[   81.351374][ T8612]  ? _copy_from_iter_full+0x25d/0x8c0
[   81.356765][ T8612]  ? __sanitizer_cov_trace_cmp8+0x18/0x20
[   81.362479][ T8612]  ? __check_object_size+0x3d/0x42f
[   81.367667][ T8612]  netlink_sendmsg+0x8ae/0xd70
[   81.372415][ T8612]  ? netlink_unicast+0x710/0x710
[   81.377520][ T8612]  ? aa_sock_msg_perm.isra.0+0xba/0x170
[   81.383051][ T8612]  ? apparmor_socket_sendmsg+0x2a/0x30
[   81.388530][ T8612]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[   81.394775][ T8612]  ? security_socket_sendmsg+0x8d/0xc0
[   81.400240][ T8612]  ? netlink_unicast+0x710/0x710
[   81.405161][ T8612]  sock_sendmsg+0xd7/0x130
[   81.409561][ T8612]  ___sys_sendmsg+0x803/0x920
[   81.414220][ T8612]  ? copy_msghdr_from_user+0x430/0x430
[   81.419667][ T8612]  ? _raw_spin_unlock_irqrestore+0x6b/0xe0
[   81.425461][ T8612]  ? debug_object_active_state+0x25d/0x380
[   81.431255][ T8612]  ? __lock_acquire+0x54f/0x5490
[   81.436174][ T8612]  ? kasan_check_read+0x11/0x20
[   81.441208][ T8612]  ? _raw_spin_unlock_irqrestore+0xa4/0xe0
[   81.447008][ T8612]  ? mark_held_locks+0xf0/0xf0
[   81.451768][ T8612]  ? __call_rcu.constprop.0+0x28b/0x720
[   81.457298][ T8612]  ? call_rcu+0xb/0x10
[   81.461353][ T8612]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[   81.467589][ T8612]  ? __fget_light+0x1a9/0x230
[   81.472249][ T8612]  ? __fdget+0x1b/0x20
[   81.476301][ T8612]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[   81.482545][ T8612]  __sys_sendmsg+0x105/0x1d0
[   81.487115][ T8612]  ? __ia32_sys_shutdown+0x80/0x80
[   81.492208][ T8612]  ? blkcg_maybe_throttle_current+0x5e2/0xfb0
[   81.498272][ T8612]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[   81.503726][ T8612]  ? do_syscall_64+0x26/0x680
[   81.508385][ T8612]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   81.514444][ T8612]  ? do_syscall_64+0x26/0x680
[   81.519110][ T8612]  __x64_sys_sendmsg+0x78/0xb0
[   81.523862][ T8612]  do_syscall_64+0xfd/0x680
[   81.528352][ T8612]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   81.534234][ T8612] RIP: 0033:0x444679
[   81.538114][ T8612] Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b d8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
[   81.557709][ T8612] RSP: 002b:00007fff0201a8b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[   81.566147][ T8612] RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000444679
[   81.574108][ T8612] RDX: 0000000000000000 RSI: 0000000020000580 RDI: 0000000000000003
[   81.582152][ T8612] RBP: 00000000006cf018 R08: 0000000000000001 R09: 00000000004002e0
[   81.590113][ T8612] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000402320
[   81.598089][ T8612] R13: 00000000004023b0 R14: 0000000000000000 R15: 0000000000

In commit e9c1a793210f ("tipc: add dst_cache support for udp media")
dst_cache_get() was introduced to be called in tipc_udp_xmit(). But
smp_processor_id() called by dst_cache_get() cannot be invoked in
preemptible context, as a result, the complaint above was reported.

Fixes: e9c1a793210f ("tipc: add dst_cache support for udp media")
syzbot+1a68504d96cd17b33a05@syzkaller.appspotmail.com
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/tipc/udp_media.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 287df687..ca3ae2e 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -224,6 +224,8 @@ static int tipc_udp_send_msg(struct net *net, struct sk_buff *skb,
 	struct udp_bearer *ub;
 	int err = 0;
 
+	local_bh_disable();
+
 	if (skb_headroom(skb) < UDP_MIN_HEADROOM) {
 		err = pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
 		if (err)
@@ -237,9 +239,12 @@ static int tipc_udp_send_msg(struct net *net, struct sk_buff *skb,
 		goto out;
 	}
 
-	if (addr->broadcast != TIPC_REPLICAST_SUPPORT)
-		return tipc_udp_xmit(net, skb, ub, src, dst,
-				     &ub->rcast.dst_cache);
+	if (addr->broadcast != TIPC_REPLICAST_SUPPORT) {
+		err = tipc_udp_xmit(net, skb, ub, src, dst,
+				    &ub->rcast.dst_cache);
+		local_bh_enable();
+		return err;
+	}
 
 	/* Replicast, send an skb to each configured IP address */
 	list_for_each_entry_rcu(rcast, &ub->rcast.list, list) {
@@ -259,6 +264,7 @@ static int tipc_udp_send_msg(struct net *net, struct sk_buff *skb,
 	err = 0;
 out:
 	kfree_skb(skb);
+	local_bh_enable();
 	return err;
 }
 
-- 
2.7.4


^ permalink raw reply related

* [PATCH 2/3] tipc: fix memory leak issue
From: Ying Xue @ 2019-08-09  7:16 UTC (permalink / raw)
  To: davem, netdev; +Cc: jon.maloy, hdanton, tipc-discussion, syzkaller-bugs
In-Reply-To: <1565335017-21302-1-git-send-email-ying.xue@windriver.com>

syzbot found the following memory leak issue:

[   72.286706][ T7064] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
BUG: memory leak
unreferenced object 0xffff888122bca200 (size 128):
  comm "syz-executor232", pid 7065, jiffies 4294943817 (age 8.880s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 18 a2 bc 22 81 88 ff ff  ..........."....
  backtrace:
    [<000000005bada299>] kmem_cache_alloc_trace+0x145/0x2c0
    [<00000000e7bcdc9f>] tipc_group_create_member+0x3c/0x190
    [<0000000005f56f40>] tipc_group_add_member+0x34/0x40
    [<0000000044406683>] tipc_nametbl_build_group+0x9b/0xf0
    [<000000009f71e803>] tipc_setsockopt+0x170/0x490
    [<000000007f61cbc2>] __sys_setsockopt+0x10f/0x220
    [<00000000cc630372>] __x64_sys_setsockopt+0x26/0x30
    [<00000000ec30be33>] do_syscall_64+0x76/0x1a0
    [<00000000271be3e6>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: syzbot+f95d90c454864b3b5bc9@syzkaller.appspotmail.com
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/tipc/group.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/net/tipc/group.c b/net/tipc/group.c
index 5f98d38..cbc540a 100644
--- a/net/tipc/group.c
+++ b/net/tipc/group.c
@@ -273,8 +273,8 @@ static struct tipc_member *tipc_group_find_node(struct tipc_group *grp,
 	return NULL;
 }
 
-static void tipc_group_add_to_tree(struct tipc_group *grp,
-				   struct tipc_member *m)
+struct tipc_member *tipc_group_add_to_tree(struct tipc_group *grp,
+					   struct tipc_member *m)
 {
 	u64 nkey, key = (u64)m->node << 32 | m->port;
 	struct rb_node **n, *parent = NULL;
@@ -282,7 +282,6 @@ static void tipc_group_add_to_tree(struct tipc_group *grp,
 
 	n = &grp->members.rb_node;
 	while (*n) {
-		tmp = container_of(*n, struct tipc_member, tree_node);
 		parent = *n;
 		tmp = container_of(parent, struct tipc_member, tree_node);
 		nkey = (u64)tmp->node << 32 | tmp->port;
@@ -291,17 +290,18 @@ static void tipc_group_add_to_tree(struct tipc_group *grp,
 		else if (key > nkey)
 			n = &(*n)->rb_right;
 		else
-			return;
+			return tmp;
 	}
 	rb_link_node(&m->tree_node, parent, n);
 	rb_insert_color(&m->tree_node, &grp->members);
+	return m;
 }
 
 static struct tipc_member *tipc_group_create_member(struct tipc_group *grp,
 						    u32 node, u32 port,
 						    u32 instance, int state)
 {
-	struct tipc_member *m;
+	struct tipc_member *m, *n;
 
 	m = kzalloc(sizeof(*m), GFP_ATOMIC);
 	if (!m)
@@ -315,10 +315,14 @@ static struct tipc_member *tipc_group_create_member(struct tipc_group *grp,
 	m->instance = instance;
 	m->bc_acked = grp->bc_snd_nxt - 1;
 	grp->member_cnt++;
-	tipc_group_add_to_tree(grp, m);
-	tipc_nlist_add(&grp->dests, m->node);
-	m->state = state;
-	return m;
+	n = tipc_group_add_to_tree(grp, m);
+	if (n == m) {
+		tipc_nlist_add(&grp->dests, m->node);
+		m->state = state;
+	} else {
+		kfree(m);
+	}
+	return n;
 }
 
 void tipc_group_add_member(struct tipc_group *grp, u32 node,
-- 
2.7.4


^ permalink raw reply related

* [PATCH 0/3] Fix three issues found by syzbot
From: Ying Xue @ 2019-08-09  7:16 UTC (permalink / raw)
  To: davem, netdev; +Cc: jon.maloy, hdanton, tipc-discussion, syzkaller-bugs

In this series, try to fix two memory leak issues and another issue of
calling smp_processor_id() in preemptible context.

Ying Xue (3):
  tipc: fix memory leak issue
  tipc: fix memory leak issue
  tipc: fix issue of calling smp_processor_id() in preemptible

 net/tipc/group.c     | 22 +++++++++++++---------
 net/tipc/node.c      |  7 +++++--
 net/tipc/udp_media.c | 12 +++++++++---
 3 files changed, 27 insertions(+), 14 deletions(-)

-- 
2.7.4


^ permalink raw reply

* [PATCH 1/3] tipc: fix memory leak issue
From: Ying Xue @ 2019-08-09  7:16 UTC (permalink / raw)
  To: davem, netdev; +Cc: jon.maloy, hdanton, tipc-discussion, syzkaller-bugs
In-Reply-To: <1565335017-21302-1-git-send-email-ying.xue@windriver.com>

syzbot found the following memory leak:

[   68.602482][ T7130] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
BUG: memory leak
unreferenced object 0xffff88810df83c00 (size 512):
  comm "softirq", pid 0, jiffies 4294942354 (age 19.830s)
  hex dump (first 32 bytes):
    38 1a 0d 0f 81 88 ff ff 38 1a 0d 0f 81 88 ff ff  8.......8.......
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000009375ee42>] kmem_cache_alloc_node+0x153/0x2a0
    [<000000004c563922>] __alloc_skb+0x6e/0x210
    [<00000000ec87bfa1>] tipc_buf_acquire+0x2f/0x80
    [<00000000d151ef84>] tipc_msg_create+0x37/0xe0
    [<000000008bb437b0>] tipc_group_create_event+0xb3/0x1b0
    [<00000000947b1d0f>] tipc_group_proto_rcv+0x569/0x640
    [<00000000b75ab039>] tipc_sk_filter_rcv+0x9ac/0xf20
    [<000000000dab7a6c>] tipc_sk_rcv+0x494/0x8a0
    [<00000000023a7ddd>] tipc_node_xmit+0x196/0x1f0
    [<00000000337dd9eb>] tipc_node_distr_xmit+0x7d/0x120
    [<00000000b6375182>] tipc_group_delete+0xe6/0x130
    [<000000000361ba2b>] tipc_sk_leave+0x57/0xb0
    [<000000009df90505>] tipc_release+0x7b/0x5e0
    [<000000009f3189da>] __sock_release+0x4b/0xe0
    [<00000000d3568ee0>] sock_close+0x1b/0x30
    [<00000000266a6215>] __fput+0xed/0x300

Reported-by: syzbot+78fbe679c8ca8d264a8d@syzkaller.appspotmail.com
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/tipc/node.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index 7ca0190..d1852fc 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1469,10 +1469,13 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list,
 	spin_unlock_bh(&le->lock);
 	tipc_node_read_unlock(n);
 
-	if (unlikely(rc == -ENOBUFS))
+	if (unlikely(rc == -ENOBUFS)) {
 		tipc_node_link_down(n, bearer_id, false);
-	else
+		skb_queue_purge(list);
+		skb_queue_purge(&xmitq);
+	} else {
 		tipc_bearer_xmit(net, bearer_id, &xmitq, &le->maddr);
+	}
 
 	tipc_node_put(n);
 
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH net-next] r8169: make use of xmit_more
From: Holger Hoffstätte @ 2019-08-09  8:04 UTC (permalink / raw)
  To: Heiner Kallweit, Realtek linux nic maintainers, David Miller
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Eric Dumazet
In-Reply-To: <f08a3207-0930-4b71-16f1-81e352f87a9c@gmail.com>

On 8/8/19 10:08 PM, Heiner Kallweit wrote:
(..snip..)
>>>
>>> I was about to ask exactly that, whether you have TSO enabled. I don't know what
>>> can trigger the HW issue, it was just confirmed by Realtek that this chip version
>>> has a problem with TSO. So the logical conclusion is: test w/o TSO, ideally the
>>> linux-next version.
>>
>> So disabling TSO alone didn't work - it leads to reduced throughout (~70 MB/s in iperf).
>> Instead I decided to backport 93681cd7d94f ("r8169: enable HW csum and TSO"), which
>> wasn't easy due to cleanups/renamings of dependencies, but I managed to backport
>> it and .. got the same problem of reduced throughout. wat?!
>>
>> After lots of trial & error I started disabling all offloads and finally found
>> that sg (Scatter-Gather) enabled alone - without TSO - will lead to the throughput
>> drop. So the culprit seems 93681cd7d94f, which disabled TSO on my NIC, but left
>> sg on by default. This weas repeatable - switch on sg, throughput drop; turn it
>> off - smooth sailing, now with reduced buffers.
>>
>> I modified the relevant bits to disable tso & sg like this:
>>
>>      /* RTL8168e-vl has a HW issue with TSO */
>>      if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
>> +        dev->vlan_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
>> +        dev->hw_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
>> +        dev->features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
>>      }
>>
>> This seems to work since it restores performance without sg/tso by default
>> and without any additional offloads, yet with xmit_more in the mix.
>> We'll see whether that is stable over the next few days, but I strongly
>> suspect it will be good and that the hiccups were due to xmit_more/TSO
>> interaction.

So that didn't take long - got another timeout this morning during some
random light usage, despite sg/tso being disabled this time.
Again the only common element is the xmit_more patch. :(
Not sure whether you want to revert this right away or wait for 5.4-rc1
feedback. Maybe this too is chipset-specific?

> Thanks a lot for the analysis and testing. Then I'll submit the disabling
> of SG on RTL8168evl (on your behalf), independent of whether it fixes
> the timeout issue.

Got it, thanks!

Holger

^ permalink raw reply

* Re: [PATCH net-next] r8169: make use of xmit_more
From: Eric Dumazet @ 2019-08-09  8:25 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Heiner Kallweit, Realtek linux nic maintainers, David Miller,
	netdev@vger.kernel.org, Sander Eikelenboom
In-Reply-To: <eecaaf82-e6cd-2b75-5756-006a70258a9f@applied-asynchrony.com>

On Fri, Aug 9, 2019 at 10:04 AM Holger Hoffstätte
<holger@applied-asynchrony.com> wrote:
>
> On 8/8/19 10:08 PM, Heiner Kallweit wrote:
> (..snip..)
> >>>
> >>> I was about to ask exactly that, whether you have TSO enabled. I don't know what
> >>> can trigger the HW issue, it was just confirmed by Realtek that this chip version
> >>> has a problem with TSO. So the logical conclusion is: test w/o TSO, ideally the
> >>> linux-next version.
> >>
> >> So disabling TSO alone didn't work - it leads to reduced throughout (~70 MB/s in iperf).
> >> Instead I decided to backport 93681cd7d94f ("r8169: enable HW csum and TSO"), which
> >> wasn't easy due to cleanups/renamings of dependencies, but I managed to backport
> >> it and .. got the same problem of reduced throughout. wat?!
> >>
> >> After lots of trial & error I started disabling all offloads and finally found
> >> that sg (Scatter-Gather) enabled alone - without TSO - will lead to the throughput
> >> drop. So the culprit seems 93681cd7d94f, which disabled TSO on my NIC, but left
> >> sg on by default. This weas repeatable - switch on sg, throughput drop; turn it
> >> off - smooth sailing, now with reduced buffers.
> >>
> >> I modified the relevant bits to disable tso & sg like this:
> >>
> >>      /* RTL8168e-vl has a HW issue with TSO */
> >>      if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
> >> +        dev->vlan_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
> >> +        dev->hw_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
> >> +        dev->features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
> >>      }
> >>
> >> This seems to work since it restores performance without sg/tso by default
> >> and without any additional offloads, yet with xmit_more in the mix.
> >> We'll see whether that is stable over the next few days, but I strongly
> >> suspect it will be good and that the hiccups were due to xmit_more/TSO
> >> interaction.
>
> So that didn't take long - got another timeout this morning during some
> random light usage, despite sg/tso being disabled this time.
> Again the only common element is the xmit_more patch. :(
> Not sure whether you want to revert this right away or wait for 5.4-rc1
> feedback. Maybe this too is chipset-specific?
>
> > Thanks a lot for the analysis and testing. Then I'll submit the disabling
> > of SG on RTL8168evl (on your behalf), independent of whether it fixes
> > the timeout issue.
>
> Got it, thanks!
>
> Holger

I would try this fix maybe ?

diff --git a/drivers/net/ethernet/realtek/r8169_main.c
b/drivers/net/ethernet/realtek/r8169_main.c
index b2a275d8504cf099cff738f2f7554efa9658fe32..e77628813daba493ad50dab9ac1e3703e38b560c
100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -5691,6 +5691,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
                 */
                smp_wmb();
                netif_stop_queue(dev);
+               door_bell = true;
        }

        if (door_bell)

^ permalink raw reply

* Re: memory leak in sctp_get_port_local (2)
From: Xin Long @ 2019-08-09  8:33 UTC (permalink / raw)
  To: syzbot
  Cc: davem, LKML, linux-sctp, Marcelo Ricardo Leitner, network dev,
	Neil Horman, syzkaller-bugs, Vlad Yasevich
In-Reply-To: <000000000000f93dd2058f9c4873@google.com>

On Thu, Aug 8, 2019 at 11:01 PM syzbot
<syzbot+2d7ecdf99f15689032b3@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    0eb0ce0a Merge tag 'spi-fix-v5.3-rc3' of git://git.kernel...
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1234588c600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=39113f5c48aea971
> dashboard link: https://syzkaller.appspot.com/bug?extid=2d7ecdf99f15689032b3
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=160e1906600000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=140ab906600000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+2d7ecdf99f15689032b3@syzkaller.appspotmail.com
>
> executing program
> executing program
> executing program
> executing program
> executing program
> BUG: memory leak
> unreferenced object 0xffff88810fa4b380 (size 64):
>    comm "syz-executor900", pid 7117, jiffies 4294946947 (age 16.560s)
>    hex dump (first 32 bytes):
>      20 4e 00 00 89 e7 4c 8d 00 00 00 00 00 00 00 00   N....L.........
>      58 40 dd 16 82 88 ff ff 00 00 00 00 00 00 00 00  X@..............
>    backtrace:
>      [<00000000f1461735>] kmemleak_alloc_recursive
> include/linux/kmemleak.h:43 [inline]
>      [<00000000f1461735>] slab_post_alloc_hook mm/slab.h:522 [inline]
>      [<00000000f1461735>] slab_alloc mm/slab.c:3319 [inline]
>      [<00000000f1461735>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
>      [<00000000ff3ccf22>] sctp_bucket_create net/sctp/socket.c:8374 [inline]
>      [<00000000ff3ccf22>] sctp_get_port_local+0x189/0x5b0
> net/sctp/socket.c:8121
>      [<00000000eed41612>] sctp_do_bind+0xcc/0x1e0 net/sctp/socket.c:402
>      [<000000002bf65239>] sctp_bind+0x44/0x70 net/sctp/socket.c:302
>      [<00000000b1aaaf57>] inet_bind+0x40/0xc0 net/ipv4/af_inet.c:441
>      [<00000000db36b917>] __sys_bind+0x11c/0x140 net/socket.c:1647
>      [<00000000679cfe3c>] __do_sys_bind net/socket.c:1658 [inline]
>      [<00000000679cfe3c>] __se_sys_bind net/socket.c:1656 [inline]
>      [<00000000679cfe3c>] __x64_sys_bind+0x1e/0x30 net/socket.c:1656
>      [<000000002aac3ac2>] do_syscall_64+0x76/0x1a0
> arch/x86/entry/common.c:296
>      [<000000000c38e074>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> BUG: memory leak
> unreferenced object 0xffff88810fa4b380 (size 64):
>    comm "syz-executor900", pid 7117, jiffies 4294946947 (age 19.260s)
>    hex dump (first 32 bytes):
>      20 4e 00 00 89 e7 4c 8d 00 00 00 00 00 00 00 00   N....L.........
>      58 40 dd 16 82 88 ff ff 00 00 00 00 00 00 00 00  X@..............
>    backtrace:
>      [<00000000f1461735>] kmemleak_alloc_recursive
> include/linux/kmemleak.h:43 [inline]
>      [<00000000f1461735>] slab_post_alloc_hook mm/slab.h:522 [inline]
>      [<00000000f1461735>] slab_alloc mm/slab.c:3319 [inline]
>      [<00000000f1461735>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
>      [<00000000ff3ccf22>] sctp_bucket_create net/sctp/socket.c:8374 [inline]
>      [<00000000ff3ccf22>] sctp_get_port_local+0x189/0x5b0
> net/sctp/socket.c:8121
>      [<00000000eed41612>] sctp_do_bind+0xcc/0x1e0 net/sctp/socket.c:402
>      [<000000002bf65239>] sctp_bind+0x44/0x70 net/sctp/socket.c:302
>      [<00000000b1aaaf57>] inet_bind+0x40/0xc0 net/ipv4/af_inet.c:441
>      [<00000000db36b917>] __sys_bind+0x11c/0x140 net/socket.c:1647
>      [<00000000679cfe3c>] __do_sys_bind net/socket.c:1658 [inline]
>      [<00000000679cfe3c>] __se_sys_bind net/socket.c:1656 [inline]
>      [<00000000679cfe3c>] __x64_sys_bind+0x1e/0x30 net/socket.c:1656
>      [<000000002aac3ac2>] do_syscall_64+0x76/0x1a0
> arch/x86/entry/common.c:296
>      [<000000000c38e074>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> BUG: memory leak
> unreferenced object 0xffff88810fa4b380 (size 64):
>    comm "syz-executor900", pid 7117, jiffies 4294946947 (age 21.990s)
>    hex dump (first 32 bytes):
>      20 4e 00 00 89 e7 4c 8d 00 00 00 00 00 00 00 00   N....L.........
>      58 40 dd 16 82 88 ff ff 00 00 00 00 00 00 00 00  X@..............
>    backtrace:
>      [<00000000f1461735>] kmemleak_alloc_recursive
> include/linux/kmemleak.h:43 [inline]
>      [<00000000f1461735>] slab_post_alloc_hook mm/slab.h:522 [inline]
>      [<00000000f1461735>] slab_alloc mm/slab.c:3319 [inline]
>      [<00000000f1461735>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
>      [<00000000ff3ccf22>] sctp_bucket_create net/sctp/socket.c:8374 [inline]
>      [<00000000ff3ccf22>] sctp_get_port_local+0x189/0x5b0
> net/sctp/socket.c:8121
>      [<00000000eed41612>] sctp_do_bind+0xcc/0x1e0 net/sctp/socket.c:402
>      [<000000002bf65239>] sctp_bind+0x44/0x70 net/sctp/socket.c:302
>      [<00000000b1aaaf57>] inet_bind+0x40/0xc0 net/ipv4/af_inet.c:441
>      [<00000000db36b917>] __sys_bind+0x11c/0x140 net/socket.c:1647
>      [<00000000679cfe3c>] __do_sys_bind net/socket.c:1658 [inline]
>      [<00000000679cfe3c>] __se_sys_bind net/socket.c:1656 [inline]
>      [<00000000679cfe3c>] __x64_sys_bind+0x1e/0x30 net/socket.c:1656
>      [<000000002aac3ac2>] do_syscall_64+0x76/0x1a0
> arch/x86/entry/common.c:296
>      [<000000000c38e074>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> BUG: memory leak
> unreferenced object 0xffff88810fa4b380 (size 64):
>    comm "syz-executor900", pid 7117, jiffies 4294946947 (age 22.940s)
>    hex dump (first 32 bytes):
>      20 4e 00 00 89 e7 4c 8d 00 00 00 00 00 00 00 00   N....L.........
>      58 40 dd 16 82 88 ff ff 00 00 00 00 00 00 00 00  X@..............
>    backtrace:
>      [<00000000f1461735>] kmemleak_alloc_recursive
> include/linux/kmemleak.h:43 [inline]
>      [<00000000f1461735>] slab_post_alloc_hook mm/slab.h:522 [inline]
>      [<00000000f1461735>] slab_alloc mm/slab.c:3319 [inline]
>      [<00000000f1461735>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
>      [<00000000ff3ccf22>] sctp_bucket_create net/sctp/socket.c:8374 [inline]
>      [<00000000ff3ccf22>] sctp_get_port_local+0x189/0x5b0
> net/sctp/socket.c:8121
>      [<00000000eed41612>] sctp_do_bind+0xcc/0x1e0 net/sctp/socket.c:402
>      [<000000002bf65239>] sctp_bind+0x44/0x70 net/sctp/socket.c:302
>      [<00000000b1aaaf57>] inet_bind+0x40/0xc0 net/ipv4/af_inet.c:441
>      [<00000000db36b917>] __sys_bind+0x11c/0x140 net/socket.c:1647
>      [<00000000679cfe3c>] __do_sys_bind net/socket.c:1658 [inline]
>      [<00000000679cfe3c>] __se_sys_bind net/socket.c:1656 [inline]
>      [<00000000679cfe3c>] __x64_sys_bind+0x1e/0x30 net/socket.c:1656
>      [<000000002aac3ac2>] do_syscall_64+0x76/0x1a0
> arch/x86/entry/common.c:296
>      [<000000000c38e074>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> executing program
> executing program
> executing program
> executing program
should be fixed by:
commit 9b6c08878e23adb7cc84bdca94d8a944b03f099e
Author: Xin Long <lucien.xin@gmail.com>
Date:   Wed Jun 26 16:31:39 2019 +0800

    sctp: not bind the socket in sctp_connect

was this commit included in the testing kernel?

>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: Jan Kara @ 2019-08-09  8:34 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Michal Hocko, Jan Kara, John Hubbard, Matthew Wilcox,
	john.hubbard, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dave Hansen, Jason Gunthorpe,
	Jérôme Glisse, LKML, amd-gfx, ceph-devel, devel, devel,
	dri-devel, intel-gfx, kvm, linux-arm-kernel, linux-block,
	linux-crypto, linux-fbdev, linux-fsdevel, linux-media, linux-mm,
	linux-nfs, linux-rdma, linux-rpi-kernel, linux-xfs, netdev,
	rds-devel, sparclinux, x86, xen-devel
In-Reply-To: <20190808023637.GA1508@iweiny-DESK2.sc.intel.com>

On Wed 07-08-19 19:36:37, Ira Weiny wrote:
> On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote:
> > > So I think your debug option and my suggested renaming serve a bit
> > > different purposes (and thus both make sense). If you do the renaming, you
> > > can just grep to see unconverted sites. Also when someone merges new GUP
> > > user (unaware of the new rules) while you switch GUP to use pins instead of
> > > ordinary references, you'll get compilation error in case of renaming
> > > instead of hard to debug refcount leak without the renaming. And such
> > > conflict is almost bound to happen given the size of GUP patch set... Also
> > > the renaming serves against the "coding inertia" - i.e., GUP is around for
> > > ages so people just use it without checking any documentation or comments.
> > > After switching how GUP works, what used to be correct isn't anymore so
> > > renaming the function serves as a warning that something has really
> > > changed.
> > 
> > Fully agreed!
> 
> Ok Prior to this I've been basing all my work for the RDMA/FS DAX stuff in
> Johns put_user_pages()...  (Including when I proposed failing truncate with a
> lease in June [1])
> 
> However, based on the suggestions in that thread it became clear that a new
> interface was going to need to be added to pass in the "RDMA file" information
> to GUP to associate file pins with the correct processes...
> 
> I have many drawings on my white board with "a whole lot of lines" on them to
> make sure that if a process opens a file, mmaps it, pins it with RDMA, _closes_
> it, and ummaps it; that the resulting file pin can still be traced back to the
> RDMA context and all the processes which may have access to it....  No matter
> where the original context may have come from.  I believe I have accomplished
> that.
> 
> Before I go on, I would like to say that the "imbalance" of get_user_pages()
> and put_page() bothers me from a purist standpoint...  However, since this
> discussion cropped up I went ahead and ported my work to Linus' current master
> (5.3-rc3+) and in doing so I only had to steal a bit of Johns code...  Sorry
> John...  :-(
> 
> I don't have the commit messages all cleaned up and I know there may be some
> discussion on these new interfaces but I wanted to throw this series out there
> because I think it may be what Jan and Michal are driving at (or at least in
> that direction.
> 
> Right now only RDMA and DAX FS's are supported.  Other users of GUP will still
> fail on a DAX file and regular files will still be at risk.[2]
> 
> I've pushed this work (based 5.3-rc3+ (33920f1ec5bf)) here[3]:
> 
> https://github.com/weiny2/linux-kernel/tree/linus-rdmafsdax-b0-v3
> 
> I think the most relevant patch to this conversation is:
> 
> https://github.com/weiny2/linux-kernel/commit/5d377653ba5cf11c3b716f904b057bee6641aaf6
> 
> I stole Jans suggestion for a name as the name I used while prototyping was
> pretty bad...  So Thanks Jan...  ;-)

For your function, I'd choose a name like vaddr_pin_leased_pages() so that
association with a lease is clear from the name :) Also I'd choose the
counterpart to be vaddr_unpin_leased_page[s](). Especially having put_page in
the name looks confusing to me...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] net: mvpp2: implement RXAUI support
From: Antoine Tenart @ 2019-08-09  8:06 UTC (permalink / raw)
  To: Matt Pelland; +Cc: netdev, maxime.chevallier, antoine.tenart
In-Reply-To: <20190808230606.7900-2-mpelland@starry.com>

Hello Matt,

On Thu, Aug 08, 2019 at 07:06:05PM -0400, Matt Pelland wrote:
>  
> +static void mvpp22_gop_init_rxaui(struct mvpp2_port *port)
> +{
> +	struct mvpp2 *priv = port->priv;
> +	void __iomem *xpcs;
> +	u32 val;
> +
> +	xpcs = priv->iface_base + MVPP22_XPCS_BASE(port->gop_id);
> +
> +	val = readl(xpcs + MVPP22_XPCS_CFG0);
> +	val &= ~MVPP22_XPCS_CFG0_RESET_DIS;
> +	writel(val, xpcs + MVPP22_XPCS_CFG0);

The reset logic of the various blocks in PPv2 is handled outside of the
GoP init functions. You should only modify the XPCS configuration here,
without taking care of the reset. See mvpp22_pcs_reset_assert() and
mvpp22_pcs_reset_deassert().

Note that gop_init() is always called with the XPCS reset asserted.

>  static void mvpp22_gop_init_10gkr(struct mvpp2_port *port)
>  {
>  	struct mvpp2 *priv = port->priv;
> @@ -1065,6 +1089,9 @@ static int mvpp22_gop_init(struct mvpp2_port *port)
>  	case PHY_INTERFACE_MODE_2500BASEX:
>  		mvpp22_gop_init_sgmii(port);
>  		break;
> +	case PHY_INTERFACE_MODE_RXAUI:
> +		mvpp22_gop_init_rxaui(port);
> +		break;

Isn't RXAUI only supported on port #0? (Such as the 10GKR mode below).

>  	case PHY_INTERFACE_MODE_10GKR:
>  		if (port->gop_id != 0)
>  			goto invalid_conf;

>  		   MVPP22_XLG_CTRL4_EN_IDLE_CHECK);
>  	ctrl4 |= MVPP22_XLG_CTRL4_FWD_FC | MVPP22_XLG_CTRL4_FWD_PFC;
>  
> +	if (state->interface == PHY_INTERFACE_MODE_RXAUI)
> +		ctrl4 |= MVPP22_XLG_CTRL4_USE_XPCS;

You should probably mask MVPP22_XLG_CTRL4_USE_XPCS when the interface
isn't RXAUI (just to be consistent with what's done in the configuration
functions). You can do this a few lines before, some bits of ctrl4 get
masked.

Thanks!
Antoine

-- 
Antoine Ténart, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] net: mvpp2: support multiple comphy lanes
From: Antoine Tenart @ 2019-08-09  8:32 UTC (permalink / raw)
  To: Matt Pelland; +Cc: netdev, maxime.chevallier, antoine.tenart
In-Reply-To: <20190808230606.7900-3-mpelland@starry.com>

Hello Matt,

On Thu, Aug 08, 2019 at 07:06:06PM -0400, Matt Pelland wrote:
>  
>  static void mvpp2_port_enable(struct mvpp2_port *port)
> @@ -3389,7 +3412,9 @@ static void mvpp2_stop_dev(struct mvpp2_port *port)
>  
>  	if (port->phylink)
>  		phylink_stop(port->phylink);
> -	phy_power_off(port->comphy);
> +
> +	if (port->priv->hw_version == MVPP22)
> +		mvpp22_comphy_deinit(port);

You can drop the check on the version here, mvpp22_comphy_deinit will
return 0 if no comphy was described. (You added other calls to this
function without the check, which is fine).

> @@ -5037,20 +5062,18 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>  			    struct fwnode_handle *port_fwnode,
>  			    struct mvpp2 *priv)
>  {
> -	struct phy *comphy = NULL;
> -	struct mvpp2_port *port;
> -	struct mvpp2_port_pcpu *port_pcpu;
> +	unsigned int ntxqs, nrxqs, ncomphys, nrequired_comphys, thread;
>  	struct device_node *port_node = to_of_node(port_fwnode);
> +	struct mvpp2_port_pcpu *port_pcpu;
>  	netdev_features_t features;
> -	struct net_device *dev;
>  	struct phylink *phylink;
> -	char *mac_from = "";
> -	unsigned int ntxqs, nrxqs, thread;
> +	struct mvpp2_port *port;
>  	unsigned long flags = 0;
> +	struct net_device *dev;
> +	int err, i, phy_mode;
> +	char *mac_from = "";
>  	bool has_tx_irqs;
>  	u32 id;
> -	int phy_mode;
> -	int err, i;
>  
>  	has_tx_irqs = mvpp2_port_has_irqs(priv, port_node, &flags);
>  	if (!has_tx_irqs && queue_mode == MVPP2_QDIST_MULTI_MODE) {
> @@ -5084,14 +5107,38 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>  		goto err_free_netdev;
>  	}
>  
> +	port = netdev_priv(dev);
> +
>  	if (port_node) {
> -		comphy = devm_of_phy_get(&pdev->dev, port_node, NULL);
> -		if (IS_ERR(comphy)) {
> -			if (PTR_ERR(comphy) == -EPROBE_DEFER) {
> -				err = -EPROBE_DEFER;
> -				goto err_free_netdev;
> +		for (i = 0, ncomphys = 0; i < ARRAY_SIZE(port->comphys); i++) {
> +			port->comphys[i] = devm_of_phy_get_by_index(&pdev->dev,
> +								    port_node,
> +								    i);
> +			if (IS_ERR(port->comphys[i])) {
> +				err = PTR_ERR(port->comphys[i]);
> +				port->comphys[i] = NULL;
> +				if (err == -EPROBE_DEFER)
> +					goto err_free_netdev;
> +				err = 0;
> +				break;
>  			}
> -			comphy = NULL;
> +
> +			++ncomphys;
> +		}
> +
> +		if (phy_mode == PHY_INTERFACE_MODE_XAUI)
> +			nrequired_comphys = 4;
> +		else if (phy_mode == PHY_INTERFACE_MODE_RXAUI)
> +			nrequired_comphys = 2;
> +		else
> +			nrequired_comphys = 1;
> +
> +		if (ncomphys < nrequired_comphys) {
> +			dev_err(&pdev->dev,
> +				"not enough comphys to support %s\n",
> +				phy_modes(phy_mode));
> +			err = -EINVAL;
> +			goto err_free_netdev;

The comphy is optional and could not be described (some SoC do not have
a driver for their comphy, and some aren't described at all). In such
cases we do rely on the bootloader/firmware configuration. Also, I'm not
sure how that would work with dynamic reconfiguration of the mode if the
n# of lanes used changes (I'm not sure that is possible though).

Thanks!
Antoine

-- 
Antoine Ténart, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH v2 net-next 0/2] net: mvpp2: Implement RXAUI Support
From: Antoine Tenart @ 2019-08-09  8:39 UTC (permalink / raw)
  To: Matt Pelland; +Cc: netdev, maxime.chevallier, antoine.tenart
In-Reply-To: <20190808230606.7900-1-mpelland@starry.com>

Hello Matt,

One small comment: it seems you made a typo on davem's email address.
It's .net, not .com :)

Thanks,
Antoine

On Thu, Aug 08, 2019 at 07:06:04PM -0400, Matt Pelland wrote:
> This patch set implements support for configuring Marvell's mvpp2 hardware for
> RXAUI operation. There are two other patches necessary for this to work
> correctly that concern Marvell's cp110 comphy that were emailed to the general
> linux-kernel mailing list earlier on. I can post them here if need be. This
> patch set was successfully tested on both a Marvell Armada 7040 based platform
> as well as an Armada 8040 based platform.
> 
> Changes since v1:
> 
> - Use reverse christmas tree formatting for all modified declaration blocks.
> - Bump MVP22_MAX_COMPHYS to 4 to allow for XAUI operation.
> - Implement comphy sanity checking.
> 
> Matt Pelland (2):
>   net: mvpp2: implement RXAUI support
>   net: mvpp2: support multiple comphy lanes
> 
>  drivers/net/ethernet/marvell/mvpp2/mvpp2.h    |   8 +-
>  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 129 ++++++++++++++----
>  2 files changed, 110 insertions(+), 27 deletions(-)
> 
> -- 
> 2.21.0
> 

-- 
Antoine Ténart, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH net-next 00/10] drop_monitor: Capture dropped packets and metadata
From: Toke Høiland-Jørgensen @ 2019-08-09  8:41 UTC (permalink / raw)
  To: Ido Schimmel, netdev
  Cc: davem, nhorman, jiri, dsahern, roopa, nikolay, jakub.kicinski,
	andy, f.fainelli, andrew, vivien.didelot, mlxsw, Ido Schimmel
In-Reply-To: <20190807103059.15270-1-idosch@idosch.org>

Ido Schimmel <idosch@idosch.org> writes:

> From: Ido Schimmel <idosch@mellanox.com>
>
> So far drop monitor supported only one mode of operation in which a
> summary of recent packet drops is periodically sent to user space as a
> netlink event. The event only includes the drop location (program
> counter) and number of drops in the last interval.
>
> While this mode of operation allows one to understand if the system is
> dropping packets, it is not sufficient if a more detailed analysis is
> required. Both the packet itself and related metadata are missing.
>
> This patchset extends drop monitor with another mode of operation where
> the packet - potentially truncated - and metadata (e.g., drop location,
> timestamp, netdev) are sent to user space as a netlink event. Thanks to
> the extensible nature of netlink, more metadata can be added in the
> future.
>
> To avoid performing expensive operations in the context in which
> kfree_skb() is called, the dropped skbs are cloned and queued on per-CPU
> skb drop list. The list is then processed in process context (using a
> workqueue), where the netlink messages are allocated, prepared and
> finally sent to user space.
>
> A follow-up patchset will integrate drop monitor with devlink and allow
> the latter to call into drop monitor to report hardware drops. In the
> future, XDP drops can be added as well, thereby making drop monitor the
> go-to netlink channel for diagnosing all packet drops.

This is great. Are you planning to add the XDP integration as well? :)

-Toke

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox