Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address
From: Michael S. Tsirkin @ 2019-09-08 11:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel, jgg, aarcange, jglisse,
	linux-mm, James Bottomley, Christoph Hellwig, David Miller,
	linux-arm-kernel, linux-parisc
In-Reply-To: <20190905122736.19768-3-jasowang@redhat.com>

On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote:
> This is a rework on the commit 7f466032dc9e ("vhost: access vq
> metadata through kernel virtual address").
> 
> It was noticed that the copy_to/from_user() friends that was used to
> access virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barriers,

So if we drop speculation barrier,
there's a problem here in access will now be speculated.
This effectively disables the defence in depth effect of
b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd
    x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec


So now we need to sprinkle array_index_nospec or barrier_nospec over the
code whenever we use an index we got from userspace.
See below for some examples.


> hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become more significant.
> 
> This patch tries to eliminate those overheads by accessing them
> through direct mapping of those pages. Invalidation callbacks is
> implemented for co-operation with general VM management (swap, KSM,
> THP or NUMA balancing). We will try to get the direct mapping of vq
> metadata before each round of packet processing if it doesn't
> exist. If we fail, we will simplely fallback to copy_to/from_user()
> friends.
> 
> This invalidation, direct mapping access and set are synchronized
> through spinlock. This takes a step back from the original commit
> 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> address") which tries to RCU which is suspicious and hard to be
> reviewed. This won't perform as well as RCU because of the atomic,
> this could be addressed by the future optimization.
> 
> This method might does not work for high mem page which requires
> temporary mapping so we just fallback to normal
> copy_to/from_user() and may not for arch that has virtual tagged cache
> since extra cache flushing is needed to eliminate the alias. This will
> result complex logic and bad performance. For those archs, this patch
> simply go for copy_to/from_user() friends. This is done by ruling out
> kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> 
> Note that this is only done when device IOTLB is not enabled. We
> could use similar method to optimize IOTLB in the future.
> 
> Tests shows at most about 22% improvement on TX PPS when using
> virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.
> 
>         SMAP on | SMAP off
> Before: 4.9Mpps | 6.9Mpps
> After:  6.0Mpps | 7.5Mpps
> 
> On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
> any difference.

Why is not Kaby Lake with SMAP off the same as Sandy Bridge?


> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: David Miller <davem@davemloft.net>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-parisc@vger.kernel.org
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/vhost/vhost.c | 551 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.h |  41 ++++
>  2 files changed, 589 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 791562e03fe0..f98155f28f02 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
>  		__vhost_vq_meta_reset(d->vqs[i]);
>  }
>  
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +static void vhost_map_unprefetch(struct vhost_map *map)
> +{
> +	kfree(map->pages);
> +	kfree(map);
> +}
> +
> +static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
> +				struct vhost_map *map, int index)
> +{
> +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> +	int i;
> +
> +	if (uaddr->write) {
> +		for (i = 0; i < map->npages; i++)
> +			set_page_dirty(map->pages[i]);
> +	}
> +}
> +
> +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> +{
> +	struct vhost_map *map[VHOST_NUM_ADDRS];
> +	int i;
> +
> +	spin_lock(&vq->mmu_lock);
> +	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> +		map[i] = vq->maps[i];
> +		if (map[i]) {
> +			vhost_set_map_dirty(vq, map[i], i);
> +			vq->maps[i] = NULL;
> +		}
> +	}
> +	spin_unlock(&vq->mmu_lock);
> +
> +	/* No need for synchronization since we are serialized with
> +	 * memory accessors (e.g vq mutex held).
> +	 */
> +
> +	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> +		if (map[i])
> +			vhost_map_unprefetch(map[i]);
> +
> +}
> +
> +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
> +{
> +	int i;
> +
> +	vhost_uninit_vq_maps(vq);
> +	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> +		vq->uaddrs[i].size = 0;
> +}
> +
> +static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> +				     unsigned long start,
> +				     unsigned long end)
> +{
> +	if (unlikely(!uaddr->size))
> +		return false;
> +
> +	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> +}
> +
> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> +{
> +	spin_lock(&vq->mmu_lock);
> +}
> +
> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> +{
> +	spin_unlock(&vq->mmu_lock);
> +}
> +
> +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> +				     int index,
> +				     unsigned long start,
> +				     unsigned long end,
> +				     bool blockable)
> +{
> +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> +	struct vhost_map *map;
> +
> +	if (!vhost_map_range_overlap(uaddr, start, end))
> +		return 0;
> +	else if (!blockable)
> +		return -EAGAIN;
> +
> +	spin_lock(&vq->mmu_lock);
> +	++vq->invalidate_count;
> +
> +	map = vq->maps[index];
> +	if (map)
> +		vq->maps[index] = NULL;
> +	spin_unlock(&vq->mmu_lock);
> +
> +	if (map) {
> +		vhost_set_map_dirty(vq, map, index);
> +		vhost_map_unprefetch(map);
> +	}
> +
> +	return 0;
> +}
> +
> +static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
> +				    int index,
> +				    unsigned long start,
> +				    unsigned long end)
> +{
> +	if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end))
> +		return;
> +
> +	spin_lock(&vq->mmu_lock);
> +	--vq->invalidate_count;
> +	spin_unlock(&vq->mmu_lock);
> +}
> +
> +static int vhost_invalidate_range_start(struct mmu_notifier *mn,
> +					const struct mmu_notifier_range *range)
> +{
> +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> +					     mmu_notifier);
> +	bool blockable = mmu_notifier_range_blockable(range);
> +	int i, j, ret;
> +
> +	for (i = 0; i < dev->nvqs; i++) {
> +		struct vhost_virtqueue *vq = dev->vqs[i];
> +
> +		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
> +			ret = vhost_invalidate_vq_start(vq, j,
> +							range->start,
> +							range->end, blockable);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void vhost_invalidate_range_end(struct mmu_notifier *mn,
> +				       const struct mmu_notifier_range *range)
> +{
> +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> +					     mmu_notifier);
> +	int i, j;
> +
> +	for (i = 0; i < dev->nvqs; i++) {
> +		struct vhost_virtqueue *vq = dev->vqs[i];
> +
> +		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> +			vhost_invalidate_vq_end(vq, j,
> +						range->start,
> +						range->end);
> +	}
> +}
> +
> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> +	.invalidate_range_start = vhost_invalidate_range_start,
> +	.invalidate_range_end = vhost_invalidate_range_end,
> +};
> +
> +static void vhost_init_maps(struct vhost_dev *dev)
> +{
> +	struct vhost_virtqueue *vq;
> +	int i, j;
> +
> +	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
> +
> +	for (i = 0; i < dev->nvqs; ++i) {
> +		vq = dev->vqs[i];
> +		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> +			vq->maps[j] = NULL;
> +	}
> +}
> +#endif
> +
>  static void vhost_vq_reset(struct vhost_dev *dev,
>  			   struct vhost_virtqueue *vq)
>  {
> @@ -326,7 +502,11 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->busyloop_timeout = 0;
>  	vq->umem = NULL;
>  	vq->iotlb = NULL;
> +	vq->invalidate_count = 0;
>  	__vhost_vq_meta_reset(vq);
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	vhost_reset_vq_maps(vq);
> +#endif
>  }
>  
>  static int vhost_worker(void *data)
> @@ -471,12 +651,15 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->iov_limit = iov_limit;
>  	dev->weight = weight;
>  	dev->byte_weight = byte_weight;
> +	dev->has_notifier = false;
>  	init_llist_head(&dev->work_list);
>  	init_waitqueue_head(&dev->wait);
>  	INIT_LIST_HEAD(&dev->read_list);
>  	INIT_LIST_HEAD(&dev->pending_list);
>  	spin_lock_init(&dev->iotlb_lock);
> -
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	vhost_init_maps(dev);
> +#endif
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
>  		vq = dev->vqs[i];
> @@ -485,6 +668,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vq->heads = NULL;
>  		vq->dev = dev;
>  		mutex_init(&vq->mutex);
> +		spin_lock_init(&vq->mmu_lock);
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
>  			vhost_poll_init(&vq->poll, vq->handle_kick,
> @@ -564,7 +748,19 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>  	if (err)
>  		goto err_cgroup;
>  
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
> +	if (err)
> +		goto err_mmu_notifier;
> +#endif
> +	dev->has_notifier = true;
> +
>  	return 0;
> +
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +err_mmu_notifier:
> +	vhost_dev_free_iovecs(dev);
> +#endif
>  err_cgroup:
>  	kthread_stop(worker);
>  	dev->worker = NULL;
> @@ -655,6 +851,107 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>  	spin_unlock(&dev->iotlb_lock);
>  }
>  
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
> +			      int index, unsigned long uaddr,
> +			      size_t size, bool write)
> +{
> +	struct vhost_uaddr *addr = &vq->uaddrs[index];
> +
> +	addr->uaddr = uaddr;
> +	addr->size = size;
> +	addr->write = write;
> +}
> +
> +static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
> +{
> +	vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
> +			  (unsigned long)vq->desc,
> +			  vhost_get_desc_size(vq, vq->num),
> +			  false);
> +	vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL,
> +			  (unsigned long)vq->avail,
> +			  vhost_get_avail_size(vq, vq->num),
> +			  false);
> +	vhost_setup_uaddr(vq, VHOST_ADDR_USED,
> +			  (unsigned long)vq->used,
> +			  vhost_get_used_size(vq, vq->num),
> +			  true);
> +}
> +
> +static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> +			       int index)
> +{
> +	struct vhost_map *map;
> +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> +	struct page **pages;
> +	int npages = DIV_ROUND_UP(uaddr->size, PAGE_SIZE);
> +	int npinned;
> +	void *vaddr, *v;
> +	int err;
> +	int i;
> +
> +	spin_lock(&vq->mmu_lock);
> +
> +	err = -EFAULT;
> +	if (vq->invalidate_count)
> +		goto err;
> +
> +	err = -ENOMEM;
> +	map = kmalloc(sizeof(*map), GFP_ATOMIC);
> +	if (!map)
> +		goto err;
> +
> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC);
> +	if (!pages)
> +		goto err_pages;
> +
> +	err = EFAULT;
> +	npinned = __get_user_pages_fast(uaddr->uaddr, npages,
> +					uaddr->write, pages);
> +	if (npinned > 0)
> +		release_pages(pages, npinned);
> +	if (npinned != npages)
> +		goto err_gup;
> +
> +	for (i = 0; i < npinned; i++)
> +		if (PageHighMem(pages[i]))
> +			goto err_gup;
> +
> +	vaddr = v = page_address(pages[0]);
> +
> +	/* For simplicity, fallback to userspace address if VA is not
> +	 * contigious.
> +	 */
> +	for (i = 1; i < npinned; i++) {
> +		v += PAGE_SIZE;
> +		if (v != page_address(pages[i]))
> +			goto err_gup;
> +	}
> +
> +	map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1));
> +	map->npages = npages;
> +	map->pages = pages;
> +
> +	vq->maps[index] = map;
> +	/* No need for a synchronize_rcu(). This function should be
> +	 * called by dev->worker so we are serialized with all
> +	 * readers.
> +	 */
> +	spin_unlock(&vq->mmu_lock);
> +
> +	return 0;
> +
> +err_gup:
> +	kfree(pages);
> +err_pages:
> +	kfree(map);
> +err:
> +	spin_unlock(&vq->mmu_lock);
> +	return err;
> +}
> +#endif
> +
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
>  	int i;
> @@ -684,8 +981,20 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  		kthread_stop(dev->worker);
>  		dev->worker = NULL;
>  	}
> -	if (dev->mm)
> +	if (dev->mm) {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +		if (dev->has_notifier) {
> +			mmu_notifier_unregister(&dev->mmu_notifier,
> +						dev->mm);
> +			dev->has_notifier = false;
> +		}
> +#endif
>  		mmput(dev->mm);
> +	}
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	for (i = 0; i < dev->nvqs; i++)
> +		vhost_uninit_vq_maps(dev->vqs[i]);
> +#endif
>  	dev->mm = NULL;
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> @@ -914,6 +1223,26 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>  
>  static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			*((__virtio16 *)&used->ring[vq->num]) =
> +				cpu_to_vhost16(vq, vq->avail_idx);
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
>  			      vhost_avail_event(vq));
>  }
> @@ -922,6 +1251,27 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  				 struct vring_used_elem *head, int idx,
>  				 int count)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +	size_t size;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			size = count * sizeof(*head);
> +			memcpy(used->ring + idx, head, size);
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
>  				  count * sizeof(*head));
>  }
> @@ -929,6 +1279,25 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
>  			      &vq->used->flags);
>  }
> @@ -936,6 +1305,25 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>  
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
>  			      &vq->used->idx);
>  }
> @@ -981,12 +1369,50 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>  static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>  				      __virtio16 *idx)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_avail *avail;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_AVAIL];
> +		if (likely(map)) {
> +			avail = map->addr;
> +			*idx = avail->idx;

index can now be speculated.

> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_avail(vq, *idx, &vq->avail->idx);
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  				       __virtio16 *head, int idx)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_avail *avail;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_AVAIL];
> +		if (likely(map)) {
> +			avail = map->addr;
> +			*head = avail->ring[idx & (vq->num - 1)];


Since idx can be speculated, I guess we need array_index_nospec here?


> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_avail(vq, *head,
>  			       &vq->avail->ring[idx & (vq->num - 1)]);
>  }
> @@ -994,24 +1420,98 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>  					__virtio16 *flags)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_avail *avail;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_AVAIL];
> +		if (likely(map)) {
> +			avail = map->addr;
> +			*flags = avail->flags;
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_avail(vq, *flags, &vq->avail->flags);
>  }
>  
>  static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>  				       __virtio16 *event)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_avail *avail;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
> +		if (likely(map)) {
> +			avail = map->addr;
> +			*event = (__virtio16)avail->ring[vq->num];
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_avail(vq, *event, vhost_used_event(vq));
>  }
>  
>  static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>  				     __virtio16 *idx)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			*idx = used->idx;
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_used(vq, *idx, &vq->used->idx);
>  }


This seems to be used during init. Why do we bother
accelerating this?


>  
>  static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>  				 struct vring_desc *desc, int idx)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_desc *d;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_DESC];
> +		if (likely(map)) {
> +			d = map->addr;
> +			*desc = *(d + idx);


Since idx can be speculated, I guess we need array_index_nospec here?


> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
>  }
>  

I also wonder about the userspace address we get eventualy.
It would seem that we need to prevent that from speculating -
and that seems like a good idea even if this patch isn't
applied. As you are playing with micro-benchmarks, maybe
you could the below patch?
It's unfortunately untested.
Thanks a lot in advance!

===>
vhost: block speculation of translated descriptors

iovec addresses coming from vhost are assumed to be
pre-validated, but in fact can be speculated to a value
out of range.

Userspace address are later validated with array_index_nospec so we can
be sure kernel info does not leak through these addresses, but vhost
must also not leak userspace info outside the allowed memory table to
guests.

Following the defence in depth principle, make sure
the address is not validated out of node range.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5dc174ac8cac..863e25011ef6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2072,7 +2076,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
-			(node->userspace_addr + addr - node->start);
+			(node->userspace_addr +
+			 array_index_nospec(addr - node->start,
+					    node->size));
 		s += size;
 		addr += size;
 		++ret;

^ permalink raw reply related

* [RFC PATCH untested] vhost: block speculation of translated descriptors
From: Michael S. Tsirkin @ 2019-09-08 11:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jason Wang, kvm, virtualization, netdev

iovec addresses coming from vhost are assumed to be
pre-validated, but in fact can be speculated to a value
out of range.

Userspace address are later validated with array_index_nospec so we can
be sure kernel info does not leak through these addresses, but vhost
must also not leak userspace info outside the allowed memory table to
guests.

Following the defence in depth principle, make sure
the address is not validated out of node range.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5dc174ac8cac..0ee375fb7145 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2072,7 +2072,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
-			(node->userspace_addr + addr - node->start);
+			(node->userspace_addr +
+			 array_index_nospec(addr - node->start,
+					    node->size));
 		s += size;
 		addr += size;
 		++ret;
-- 
MST

^ permalink raw reply related

* Re: [PATCH v1 net-next 00/15] tc-taprio offload for SJA1105 DSA
From: Vladimir Oltean @ 2019-09-08 11:07 UTC (permalink / raw)
  To: Andrew Lunn, David Miller
  Cc: f.fainelli, vivien.didelot, vinicius.gomes, vedang.patel,
	richardcochran, weifeng.voon, jiri, m-karicheri2, Jose.Abreu,
	ilias.apalodimas, jhs, xiyou.wangcong, kurt.kanzenbach, netdev
In-Reply-To: <20190907144548.GA21922@lunn.ch>

Hi Andrew, David,

On Sep 7, 2019, at 3:46 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Fri, Sep 06, 2019 at 02:54:03PM +0200, David Miller wrote:
>>
>>  From: Vladimir Oltean <olteanv@gmail.com>
>>  Date: Mon,  2 Sep 2019 19:25:29 +0300
>>
>>>
>>>  This is the first attempt to submit the tc-taprio offload model for
>>>  inclusion in the net tree.
>>
>>
>>  Someone really needs to review this.
>
> Hi Vladimir
>
> You might have more chance getting this reviewed if you split it up
> into a number of smaller series. Richard could probably review the
> plain PTP changes. Who else has worked on tc-taprio recently? A series
> purely about tc-taprio might be more likely reviewed by a tc-taprio
> person, if it does not contain PTP changes.
>
>     Andrew

I think Richard has been there when the taprio, etf qdiscs, SO_TXTIME
were first defined and developed:
https://patchwork.ozlabs.org/cover/808504/
I expect he is capable of delivering a competent review of the entire
series, possibly way more competent than my patch set itself.

The reason why I'm not splitting it up is because I lose around 10 ns
of synchronization offset when using the hardware-corrected PTPCLKVAL
clock for timestamping rather than the PTPTSCLK free-running counter.
This is mostly due to the fact that SPI interaction is reduced to a
minimum when correcting the switch's PHC in software - OTOH when that
correction translates into SPI writes to PTPCLKADD/PTPCLKVAL and
PTPCLKRATE, that's when things go a bit downhill with the precision.
Now the compromise is fully acceptable if the PTP clock is to be used
as the trigger source for the time-aware scheduler, but the conversion
would be quite pointless with no user to really require the hardware
clock.

Additionally, the 802.1AS PTP profile even calls for switches and
end-stations to use timestamping counters that are free-running, and
scale&rate-correct those in software - due to a perceived "double
feedback loop", or "changing the ruler while measuring with it". Now
I'm no expert at all, but it would be interesting if we went on with
the discussion in the direction of what Linux is currently
understanding by a "free-running" PTP counter. On one hand there's the
timecounter/cyclecounter in the kernel which makes for a
software-corrected PHC, and on the other there's the free_running
option in linuxptp which makes for a "nowhere-corrected" PHC that is
only being used in the E2E_TC and P2P_TC profiles. But user space
otherwise has no insight into the PHC implementation from the kernel,
and "free_running" from ptp4l can't really be used to implement the
synchronization mechanism required by 802.1AS.

To me, the most striking aspect is that this particular recommendation
from 802.1AS is at direct odds with 802.1Qbv (time-based egress) /
802.1Qci (time-based ingress policing) which clearly require a PTP
counter in the NIC that ticks to the wall clock, and not to a random
free-running time since boot up. I simply can't seem to reconcile the
two.
What this particular switch does is that it permits RX and TX
timestamps to be taken in either corrected or uncorrected timebases
(but unfortunately not both at the same time). I think the hardware
designers' idea was to take timestamps off the uncorrected clock
(PTPTSCLK) and then do a sort of phc2sys-to-itself: write the
software-corrected value of the timecounter/cyclecounter into the
PTPCLKVAL hardware registers which get used for Qbv/Qci.
Actually I hate to use those terms when talking about SJA1105 hardware
support, since it's more "in the style of" IEEE rather than strict
compliance (timing of the design vs the standard might have played a
role as well).

But let's leave 802.1AS aside for a second - that's not what the patch
set is about, but rather a bit of background on why there are 2 PTP
clocks in this switch, and why I'm switching from one to the other.
Richard didn't really warm up to the phc2sys-to-itself idea in the
past, and opted for simplicity: just use the hardware-corrected
PTPCLKVAL for everything, which is exactly what I'm doing as of now.

The only people whom I know are working on TSN stuff are mostly
entrenched in papers, standards and generally in the hardware-only
mentality. There is obviously a lot to be done for Linux to be a
proper TSN endpoint, and RT is a big one. For a switch in particular,
things are a bit easier due to the fact that it just needs to ensure
the real-time guarantees of a frame that was supposedly already
delivered in-band with the schedule. And there's no other way to do
that rather than through a hardware offload - otherwise the software
tc-taprio would only shape the frames egressed by the management CPU
of the switch. The tc-taprio offload for a switch only makes sense
when taken together with the bridging offload, if you will.

I "dared" to submit this for merging maybe because I don't see the
subtleties that prevent it from going in, at least for a switch - it
just works and does the job. I would have loved to see this in 5.4
just so I would have to lug around a bit less patches when finally
starting to evaluate the endpoint side of things with the 5.4-rt
patch. But nonetheless, there's no hurry and getting a healthy
discussion going is surely more important than the patches themselves
are. On the other hand there needs to be a balance, and just talking
with no code is no good either - fixes, improvements, rework can
always come later once we commit to the basic offload model.

I happen to be around at Plumbers during the following days to learn
what else is going on in the Linux community, and develop a more
complete mental model for myself for how TSN fits in with all of that.
If anybody happens to also be around, I'd be more than happy to talk.

Regards,
-Vladimir

^ permalink raw reply

* Re: [patch net-next v2 3/3] net: devlink: move reload fail indication to devlink core and expose to user
From: Ido Schimmel @ 2019-09-08 11:25 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev@vger.kernel.org, davem@davemloft.net, dsahern@gmail.com,
	jakub.kicinski@netronome.com, Tariq Toukan, mlxsw
In-Reply-To: <20190907205400.14589-4-jiri@resnulli.us>

On Sat, Sep 07, 2019 at 10:54:00PM +0200, Jiri Pirko wrote:
> +bool devlink_is_reload_failed(struct devlink *devlink)

Forgot to mention that this can be 'const'

> +{
> +	return devlink->reload_failed;
> +}
> +EXPORT_SYMBOL_GPL(devlink_is_reload_failed);

^ permalink raw reply

* Re: [PATCH bpf-next] xdp: Fix race in dev_map_hash_update_elem() when replacing element
From: Jesper Dangaard Brouer @ 2019-09-08 11:28 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: brouer, ast, bpf, daniel, davem, hawk, jakub.kicinski,
	john.fastabend, kafai, linux-kernel, netdev, songliubraving,
	syzkaller-bugs, yhs, syzbot+4e7a85b1432052e8d6f8
In-Reply-To: <20190908082016.17214-1-toke@redhat.com>

On Sun,  8 Sep 2019 09:20:16 +0100
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> syzbot found a crash in dev_map_hash_update_elem(), when replacing an
> element with a new one. Jesper correctly identified the cause of the crash
> as a race condition between the initial lookup in the map (which is done
> before taking the lock), and the removal of the old element.
> 
> Rather than just add a second lookup into the hashmap after taking the
> lock, fix this by reworking the function logic to take the lock before the
> initial lookup.
> 
> Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
> Reported-and-tested-by: syzbot+4e7a85b1432052e8d6f8@syzkaller.appspotmail.com
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: general protection fault in cbs_destroy
From: syzbot @ 2019-09-08 12:33 UTC (permalink / raw)
  To: davem, hdanton, jhs, jiri, leandro.maciel.dorileo, linux-kernel,
	netdev, syzkaller-bugs, vedang.patel, xiyou.wangcong
In-Reply-To: <000000000000d2a5c60592047e58@google.com>

syzbot has bisected this bug to:

commit e0a7683d30e91e30ee6cf96314ae58a0314a095e
Author: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Date:   Mon Apr 8 17:12:18 2019 +0000

     net/sched: cbs: fix port_rate miscalculation

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=158f3c01600000
start commit:   3b47fd5c Merge tag 'nfs-for-5.3-4' of git://git.linux-nfs...
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=178f3c01600000
console output: https://syzkaller.appspot.com/x/log.txt?x=138f3c01600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=144488c6c6c6d2b6
dashboard link: https://syzkaller.appspot.com/bug?extid=3a8d6a998cbb73bcf337
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17998f9e600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10421efa600000

Reported-by: syzbot+3a8d6a998cbb73bcf337@syzkaller.appspotmail.com
Fixes: e0a7683d30e9 ("net/sched: cbs: fix port_rate miscalculation")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Default qdisc not correctly initialized with custom MTU
From: Holger Hoffstätte @ 2019-09-08 14:13 UTC (permalink / raw)
  To: Netdev

I just installed a better NIC (Aquantia 2.5/5/10Gb, apparently with
multiple queues) and now get the "mq" pseudo-qdisc automatically installed -
so far, so good. I also configure fq_codel as default qdisc via sysctls
and a larger MTU of 9000 for the device. This somehow leads to some
slight confusion about initialization order between the qdiscs and the
device.

Right after booting, where sysctl runs before eth0 setup:

$tc qd show
qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 0: dev eth0 root
qdisc fq_codel 0: dev eth0 parent :8 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :7 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :6 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :5 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :4 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :3 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :2 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :1 limit 10240p flows 1024 quantum 1514 ...

Note that fq_codel thinks the quantum (derived from the MTU) is still 1500;
it just used the default setting as there was no link yet.

Howwver, the MTU is set to 9000:

$ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000

It seems the default qdisc is created before the link is set up, and then
just attached to mq without consideration for the actual link configuration.
Simply kicking the whole thing to replace mq with itself again does the trick:

$tc qd replace root dev eth0 mq
$tc qd show
qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 8001: dev eth0 root
qdisc fq_codel 0: dev eth0 parent 8001:8 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:7 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:6 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:5 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:4 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:3 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:2 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:1 limit 10240p flows 1024 quantum 9014 ...

Now the quanta are in line with the actual MTU.

I can't help but feel this is a slight bug in terms of initialization order,
and that the default qdisc should only be created when it's first being
used/attached to a link, not when the sysctls are configured.
Kernel is 5.2.x and I didn't see anything in 5.3 or net-next to address
this yet.

Thoughts?

Holger

^ permalink raw reply

* Re: general protection fault in qdisc_put
From: Linus Torvalds @ 2019-09-08 17:18 UTC (permalink / raw)
  To: syzbot
  Cc: akinobu.mita, Andrew Morton, David Miller, Dmitry Vyukov, jhs,
	jiri, Linux List Kernel Mailing, Michal Hocko, Netdev,
	syzkaller-bugs, Cong Wang
In-Reply-To: <000000000000df42500592047e0a@google.com>

On Sat, Sep 7, 2019 at 11:08 PM syzbot
<syzbot+d5870a903591faaca4ae@syzkaller.appspotmail.com> wrote:
>
> The bug was bisected to:
>
> commit e41d58185f1444368873d4d7422f7664a68be61d
> Author: Dmitry Vyukov <dvyukov@google.com>
> Date:   Wed Jul 12 21:34:35 2017 +0000
>
>      fault-inject: support systematic fault injection

That commit does seem a bit questionable, but not the cause of this
problem (just the trigger).

I think the questionable part is that the new code doesn't honor the
task filtering, and will fail even for protected tasks. Dmitry?

> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 9699 Comm: syz-executor169 Not tainted 5.3.0-rc7+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:qdisc_put+0x25/0x90 net/sched/sch_generic.c:983

Yes, looks like 'qdisc' is NULL.

This is the

        qdisc_put(q->qdisc);

in sfb_destroy(), called from qdisc_create().

I think what is happening is this (in qdisc_create()):

        if (ops->init) {
                err = ops->init(sch, tca[TCA_OPTIONS], extack);
                if (err != 0)
                        goto err_out5;
        }
        ...
err_out5:
        /* ops->init() failed, we call ->destroy() like qdisc_create_dflt() */
        if (ops->destroy)
                ops->destroy(sch);

and "ops->init" is sfb_init(), which will not initialize q->qdisc if
tcf_block_get() fails.

I see two solutions:

 (a) move the

        q->qdisc = &noop_qdisc;

     up earlier in sfb_init(), so that qdisc is always initialized
after sfb_init(), even on failure.

 (b) just make qdisc_put(NULL) just silently work as a no-op.

 (c) change all the semantics to not call ->destroy if ->init failed.

Honestly, (a) seems very fragile - do all the other init routines do
this? And (c) sounds like a big change, and very fragile too.

So I'd suggest that qdisc_put() be made to just ignore a NULL pointer
(and maybe an error pointer too?).

But I'll leave it to the maintainers to sort out the proper fix.
Maybe people prefer (a)?

                   Linus

^ permalink raw reply

* [Patch net] net_sched: check cops->tcf_block in tc_bind_tclass()
From: Cong Wang @ 2019-09-08 19:11 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, syzbot+21b29db13c065852f64b, Jamal Hadi Salim,
	Jiri Pirko

At least sch_red and sch_tbf don't implement ->tcf_block()
while still have a non-zero tc "class".

Instead of adding nop implementations to each of such qdisc's,
we can just relax the check of cops->tcf_block() in
tc_bind_tclass(). They don't support TC filter anyway.

Reported-by: syzbot+21b29db13c065852f64b@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_api.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 04faee7ccbce..1047825d9f48 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1920,6 +1920,8 @@ static void tc_bind_tclass(struct Qdisc *q, u32 portid, u32 clid,
 	cl = cops->find(q, portid);
 	if (!cl)
 		return;
+	if (!cops->tcf_block)
+		return;
 	block = cops->tcf_block(q, cl, NULL);
 	if (!block)
 		return;
-- 
2.21.0


^ permalink raw reply related

* [Patch net] sch_hhf: ensure quantum and hhf_non_hh_weight are non-zero
From: Cong Wang @ 2019-09-08 20:40 UTC (permalink / raw)
  To: netdev
  Cc: Cong Wang, syzbot+bc6297c11f19ee807dc2,
	syzbot+041483004a7f45f1f20a, syzbot+55be5f513bed37fc4367,
	Jamal Hadi Salim, Jiri Pirko, Terry Lam

In case of TCA_HHF_NON_HH_WEIGHT or TCA_HHF_QUANTUM is zero,
it would make no progress inside the loop in hhf_dequeue() thus
kernel would get stuck.

Fix this by checking this corner case in hhf_change().

Fixes: 10239edf86f1 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Reported-by: syzbot+bc6297c11f19ee807dc2@syzkaller.appspotmail.com
Reported-by: syzbot+041483004a7f45f1f20a@syzkaller.appspotmail.com
Reported-by: syzbot+55be5f513bed37fc4367@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Terry Lam <vtlam@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_hhf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index cee6971c1c82..23cd1c873a2c 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -531,7 +531,7 @@ static int hhf_change(struct Qdisc *sch, struct nlattr *opt,
 		new_hhf_non_hh_weight = nla_get_u32(tb[TCA_HHF_NON_HH_WEIGHT]);
 
 	non_hh_quantum = (u64)new_quantum * new_hhf_non_hh_weight;
-	if (non_hh_quantum > INT_MAX)
+	if (non_hh_quantum == 0 || non_hh_quantum > INT_MAX)
 		return -EINVAL;
 
 	sch_tree_lock(sch);
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v1 net-next 00/15] tc-taprio offload for SJA1105 DSA
From: Andrew Lunn @ 2019-09-08 20:42 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: David Miller, f.fainelli, vivien.didelot, vinicius.gomes,
	vedang.patel, richardcochran, weifeng.voon, jiri, m-karicheri2,
	Jose.Abreu, ilias.apalodimas, jhs, xiyou.wangcong,
	kurt.kanzenbach, netdev
In-Reply-To: <CA+h21hqLF1gE+aDH9xQPadCuo6ih=xWY73JZvg7c58C1tC+0Jg@mail.gmail.com>

On Sun, Sep 08, 2019 at 12:07:27PM +0100, Vladimir Oltean wrote:
> I think Richard has been there when the taprio, etf qdiscs, SO_TXTIME
> were first defined and developed:
> https://patchwork.ozlabs.org/cover/808504/
> I expect he is capable of delivering a competent review of the entire
> series, possibly way more competent than my patch set itself.
> 
> The reason why I'm not splitting it up is because I lose around 10 ns
> of synchronization offset when using the hardware-corrected PTPCLKVAL
> clock for timestamping rather than the PTPTSCLK free-running counter.

Hi Vladimir

I'm not suggesting anything is wrong with your concept, when i say
split it up. It is more than when somebody sees 15 patches, they
decide they don't have the time at the moment, and put it off until
later. And often later never happens. If however they see a smaller
number of patches, they think that yes they have time now, and do the
review.

So if you are struggling to get something reviewed, make it more
appealing for the reviewer. Salami tactics.

    Andrew

^ permalink raw reply

* Re: [RFC bpf-next 1/7] bpf: add bpf_pcap() helper to simplify packet capture
From: Yonghong Song @ 2019-09-08 22:02 UTC (permalink / raw)
  To: Alan Maguire, ast@kernel.org, daniel@iogearbox.net, Martin Lau,
	Song Liu, davem@davemloft.net, jakub.kicinski@netronome.com,
	hawk@kernel.org, john.fastabend@gmail.com, rostedt@goodmis.org,
	mingo@redhat.com, quentin.monnet@netronome.com, Andrey Ignatov,
	joe@wand.net.nz, acme@redhat.com, jolsa@kernel.org,
	alexey.budankov@linux.intel.com, gregkh@linuxfoundation.org,
	namhyung@kernel.org, sdf@google.com, f.fainelli@gmail.com,
	shuah@kernel.org, peter@lekensteyn.nl, ivan@cloudflare.com,
	Andrii Nakryiko, bhole_prashant_q7@lab.ntt.co.jp,
	david.calavera@gmail.com, danieltimlee@gmail.com,
	Takshak Chahande, netdev@vger.kernel.org, bpf@vger.kernel.org,
	linux-kselftest@vger.kernel.org
In-Reply-To: <1567892444-16344-2-git-send-email-alan.maguire@oracle.com>



On 9/7/19 2:40 PM, Alan Maguire wrote:
> bpf_pcap() simplifies packet capture for skb and XDP
> BPF programs by creating a BPF perf event containing information
> relevant for packet capture (protocol, actual/captured packet
> size, time of capture, etc) along with the packet payload itself.
> All of this is stored in a "struct bpf_pcap_hdr".
> 
> This header information can then be retrieved from the perf
> event map and used by packet capture frameworks such as libpcap
> to carry out packet capture.
> 
> skb and XDP programs currently deal in Ethernet-based traffic
> exclusively, so should specify BPF_PCAP_TYPE_ETH or
> BPF_PCAP_TYPE_UNSET.  The protocol parameter will be used
> in a later commit.
> 
> Note that libpcap assumes times are relative to the epoch while
> we record nanoseconds since boot; as a result any times need
> to be normalized with respect to the boot time for libpcap
> storage; sysinfo(2) can be used to retrieve boot time to normalize
> values appropriately.
> 
> Example usage for a tc-bpf program:
> 
> struct bpf_map_def SEC("maps") pcap_map = {
> 	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> 	.key_size = sizeof(int),
> 	.value_size = sizeof(int),
> 	.max_entries = 1024,
> };
> 
> SEC("cap")
> int cap(struct __sk_buff *skb)
> {
> 	bpf_pcap(skb, 1514, &pcap_map, BPF_PCAP_TYPE_ETH, 0);
> 
> 	return TC_ACT_OK;
> }
> 
> Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
> ---
>   include/linux/bpf.h      | 20 +++++++++++++
>   include/uapi/linux/bpf.h | 75 +++++++++++++++++++++++++++++++++++++++++++++++-
>   kernel/bpf/verifier.c    |  4 ++-
>   net/core/filter.c        | 67 ++++++++++++++++++++++++++++++++++++++++++
>   4 files changed, 164 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 5b9d223..033c9cf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1145,4 +1145,24 @@ static inline u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type,
>   }
>   #endif /* CONFIG_INET */
>   
> +
> +static inline int bpf_pcap_prepare(int protocol, u32 cap_len, u32 tot_len,
> +				   u64 flags, struct bpf_pcap_hdr *pcap)
> +{
> +	if (protocol < 0 || pcap == NULL)
> +		return -EINVAL;
> +
> +	pcap->magic = BPF_PCAP_MAGIC;
> +	pcap->protocol = protocol;
> +	pcap->flags = flags;
> +
> +	if (cap_len == 0 || tot_len < cap_len)
> +		cap_len = tot_len;
> +	pcap->cap_len = cap_len;
> +	pcap->tot_len = tot_len;
> +	pcap->ktime_ns = ktime_get_mono_fast_ns();
> +
> +	return 0;
> +}
> +
>   #endif /* _LINUX_BPF_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 77c6be9..a27e58e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2750,6 +2750,39 @@ struct bpf_stack_build_id {
>    *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>    *
>    *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_pcap(void *data, u32 size, struct bpf_map *map, int protocol,
> + *		u64 flags)
> + *	Description
> + *		Write packet data from *data* into a special BPF perf event
> + *              held by *map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This
> + *		perf event has the same attributes as perf events generated
> + *		by bpf_perf_event_output.  For skb and xdp programs, *data*
> + *		is the relevant context.
> + *
> + *		Metadata for this event is a **struct bpf_pcap_hdr**; this
> + *		contains the capture length, actual packet length and
> + *		the starting protocol.
> + *
> + *		The max number of bytes of context to store is specified via
> + *		*size*.
> + *
> + *		The flags value can be used to specify an id value of up
> + *		to 48 bits; the id can be used to correlate captured packets
> + *		with other trace data, since the passed-in flags value is stored
> + *		stored in the **struct bpf_pcap_hdr** in the **flags** field.
> + *
> + *		The *protocol* value specifies the protocol type of the start
> + *		of the packet so that packet capture can carry out
> + *		interpretation.  See **pcap-linktype** (7) for details on
> + *		the supported values.
> + *
> + *	Return
> + *		0 on success, or a negative error in case of failure.
> + *		-ENOENT will be returned if the associated perf event
> + *		map entry is empty, or the skb is zero-length.
> + *		-EINVAL will be returned if the flags value is invalid.

The feature itself indeed seems useful for networking community.
I just have some questions what kind of kernel support is needed.

We already have perf_event_output for skb and xdp.
Can we just use the original helpers and do a little bit post
processing in user space to get what you want?
It looks possible to me to generate bpf_pcap_hdr in user space
before sending output to pcap analyzer.

> + *
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -2862,7 +2895,8 @@ struct bpf_stack_build_id {
>   	FN(sk_storage_get),		\
>   	FN(sk_storage_delete),		\
>   	FN(send_signal),		\
> -	FN(tcp_gen_syncookie),
> +	FN(tcp_gen_syncookie),		\
> +	FN(pcap),
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>    * function eBPF program intends to call
> @@ -2941,6 +2975,9 @@ enum bpf_func_id {
>   /* BPF_FUNC_sk_storage_get flags */
>   #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
>   
> +/* BPF_FUNC_pcap flags */
> +#define	BPF_F_PCAP_ID_MASK		0xffffffffffff
> +
>   /* Mode for BPF_FUNC_skb_adjust_room helper. */
>   enum bpf_adj_room_mode {
>   	BPF_ADJ_ROOM_NET,
> @@ -3613,4 +3650,40 @@ struct bpf_sockopt {
>   	__s32	retval;
>   };
>   
> +/* bpf_pcap_hdr contains information related to a particular packet capture
> + * flow.  It specifies
> + *
> + * - a magic number BPF_PCAP_MAGIC which identifies the perf event as
> + *   a pcap-related event.
> + * - a starting protocol is the protocol associated with the header
> + * - a flags value, copied from the flags value passed into bpf_pcap().
> + *   IDs can be used to correlate packet capture data and other tracing data.
> + *
> + * bpf_pcap_hdr also contains the information relating to the to-be-captured
> + * packet, and closely corresponds to the struct pcap_pkthdr used by
> + * pcap_dump (3PCAP).  The bpf_pcap helper sets ktime_ns (nanoseconds since
> + * boot) to the ktime_ns value; to get sensible pcap times this value should
> + * be converted to a struct timeval time since epoch in the struct pcap_pkthdr.
> + *
> + * When bpf_pcap() is used, a "struct bpf_pcap_hdr" is stored as we
> + * need both information about the particular packet and the protocol
> + * we are capturing.
> + */
> +
> +#define BPF_PCAP_MAGIC		0xb7fca7
> +
> +struct bpf_pcap_hdr {
> +	__u32			magic;
> +	int			protocol;
> +	__u64			flags;
> +	__u64			ktime_ns;
> +	__u32			tot_len;
> +	__u32			cap_len;
> +	__u8			data[0];
> +};
> +
> +#define BPF_PCAP_TYPE_UNSET	-1
> +#define BPF_PCAP_TYPE_ETH	1
> +#define	BPF_PCAP_TYPE_IP	12
> +
>   #endif /* _UAPI__LINUX_BPF_H__ */
[...]
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ed65636..e0e23ee 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4158,6 +4158,35 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
>   	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
>   };
>   
> +BPF_CALL_5(bpf_xdp_pcap, struct xdp_buff *, xdp, u32, size,
> +	   struct bpf_map *, map, int, protocol, u64, flags)
> +{
> +	unsigned long len = (unsigned long)(xdp->data_end - xdp->data);
> +	struct bpf_pcap_hdr pcap;
> +	int ret;
> +
> +	if (unlikely(flags & ~BPF_F_PCAP_ID_MASK))
> +		return -EINVAL;
> +
> +	ret = bpf_pcap_prepare(protocol, size, len, flags, &pcap);
> +	if (ret)
> +		return ret;
> +
> +	return bpf_event_output(map, BPF_F_CURRENT_CPU, &pcap, sizeof(pcap),
> +				xdp->data, pcap.cap_len, bpf_xdp_copy);
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_pcap_proto = {
> +	.func		= bpf_xdp_pcap,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_ANYTHING,
> +	.arg3_type	= ARG_CONST_MAP_PTR,
> +	.arg4_type	= ARG_ANYTHING,
> +	.arg5_type	= ARG_ANYTHING,
> +};
> +
>   BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
>   {
>   	return skb->sk ? sock_gen_cookie(skb->sk) : 0;
> @@ -5926,6 +5955,34 @@ u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type,
>   
>   #endif /* CONFIG_INET */
>   
> +BPF_CALL_5(bpf_skb_pcap, struct sk_buff *, skb, u32, size,
> +	   struct bpf_map *, map, int, protocol, u64, flags)
> +{
> +	struct bpf_pcap_hdr pcap;
> +	int ret;
> +
> +	if (unlikely(flags & ~BPF_F_PCAP_ID_MASK))
> +		return -EINVAL;
> +
> +	ret = bpf_pcap_prepare(protocol, size, skb->len, flags, &pcap);
> +	if (ret)
> +		return ret;
> +
> +	return bpf_event_output(map, BPF_F_CURRENT_CPU, &pcap, sizeof(pcap),
> +				skb, pcap.cap_len, bpf_skb_copy);
> +}
> +
> +static const struct bpf_func_proto bpf_skb_pcap_proto = {
> +	.func		= bpf_skb_pcap,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_ANYTHING,
> +	.arg3_type	= ARG_CONST_MAP_PTR,
> +	.arg4_type      = ARG_ANYTHING,
> +	.arg5_type	= ARG_ANYTHING,
> +};
> +
>   bool bpf_helper_changes_pkt_data(void *func)
>   {
>   	if (func == bpf_skb_vlan_push ||
> @@ -6075,6 +6132,8 @@ bool bpf_helper_changes_pkt_data(void *func)
>   		return &bpf_get_socket_uid_proto;
>   	case BPF_FUNC_perf_event_output:
>   		return &bpf_skb_event_output_proto;
> +	case BPF_FUNC_pcap:
> +		return &bpf_skb_pcap_proto;
>   	default:
>   		return bpf_base_func_proto(func_id);
>   	}
> @@ -6216,6 +6275,8 @@ bool bpf_helper_changes_pkt_data(void *func)
>   	case BPF_FUNC_tcp_gen_syncookie:
>   		return &bpf_tcp_gen_syncookie_proto;
>   #endif
> +	case BPF_FUNC_pcap:
> +		return &bpf_skb_pcap_proto;
>   	default:
>   		return bpf_base_func_proto(func_id);
>   	}
> @@ -6256,6 +6317,8 @@ bool bpf_helper_changes_pkt_data(void *func)
>   		return &bpf_tcp_check_syncookie_proto;
>   	case BPF_FUNC_tcp_gen_syncookie:
>   		return &bpf_tcp_gen_syncookie_proto;
> +	case BPF_FUNC_pcap:
> +		return &bpf_xdp_pcap_proto;
>   #endif
>   	default:
>   		return bpf_base_func_proto(func_id);
> @@ -6361,6 +6424,8 @@ bool bpf_helper_changes_pkt_data(void *func)
>   	case BPF_FUNC_skc_lookup_tcp:
>   		return &bpf_skc_lookup_tcp_proto;
>   #endif
> +	case BPF_FUNC_pcap:
> +		return &bpf_skb_pcap_proto;
>   	default:
>   		return bpf_base_func_proto(func_id);
>   	}
> @@ -6399,6 +6464,8 @@ bool bpf_helper_changes_pkt_data(void *func)
>   		return &bpf_get_smp_processor_id_proto;
>   	case BPF_FUNC_skb_under_cgroup:
>   		return &bpf_skb_under_cgroup_proto;
> +	case BPF_FUNC_pcap:
> +		return &bpf_skb_pcap_proto;
>   	default:
>   		return bpf_base_func_proto(func_id);
>   	}
> 

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-09-08 22:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mickaël Salaün, linux-kernel, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Stephen Smalley, Tejun Heo,
	Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	kernel-hardening, linux-api, linux-fsdevel, linux-security-module,
	netdev
In-Reply-To: <a870c2c9-d2f7-e0fa-c8cc-35dbf8b5b87d@ssi.gouv.fr>



On 31/07/2019 20:46, Mickaël Salaün wrote:
> 
> On 27/07/2019 03:40, Alexei Starovoitov wrote:
>> On Sun, Jul 21, 2019 at 11:31:12PM +0200, Mickaël Salaün wrote:
>>> FIXME: 64-bits in the doc
> 
> FYI, this FIXME was fixed, just not removed from this message. :)
> 
>>>
>>> This new map store arbitrary values referenced by inode keys.  The map
>>> can be updated from user space with file descriptor pointing to inodes
>>> tied to a file system.  From an eBPF (Landlock) program point of view,
>>> such a map is read-only and can only be used to retrieved a value tied
>>> to a given inode.  This is useful to recognize an inode tagged by user
>>> space, without access right to this inode (i.e. no need to have a write
>>> access to this inode).
>>>
>>> Add dedicated BPF functions to handle this type of map:
>>> * bpf_inode_htab_map_update_elem()
>>> * bpf_inode_htab_map_lookup_elem()
>>> * bpf_inode_htab_map_delete_elem()
>>>
>>> This new map require a dedicated helper inode_map_lookup_elem() because
>>> of the key which is a pointer to an opaque data (only provided by the
>>> kernel).  This act like a (physical or cryptographic) key, which is why
>>> it is also not allowed to get the next key.
>>>
>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>
>> there are too many things to comment on.
>> Let's do this patch.
>>
>> imo inode_map concept is interesting, but see below...
>>
>>> +
>>> +    /*
>>> +     * Limit number of entries in an inode map to the maximum number of
>>> +     * open files for the current process. The maximum number of file
>>> +     * references (including all inode maps) for a process is then
>>> +     * (RLIMIT_NOFILE - 1) * RLIMIT_NOFILE. If the process' RLIMIT_NOFILE
>>> +     * is 0, then any entry update is forbidden.
>>> +     *
>>> +     * An eBPF program can inherit all the inode map FD. The worse case is
>>> +     * to fill a bunch of arraymaps, create an eBPF program, close the
>>> +     * inode map FDs, and start again. The maximum number of inode map
>>> +     * entries can then be close to RLIMIT_NOFILE^3.
>>> +     */
>>> +    if (attr->max_entries > rlimit(RLIMIT_NOFILE))
>>> +            return -EMFILE;
>>
>> rlimit is checked, but no fd are consumed.
>> Once created such inode map_fd can be passed to a different process.
>> map_fd can be pinned into bpffs.
>> etc.
>> what the value of the check?
> 
> I was looking for the most meaningful limit for a process, and rlimit is
> the best I found. As the limit of open FD per processes, rlimit is not
> perfect, but I think the semantic is close here (e.g. a process can also
> pass FD through unix socket).
> 
>>
>>> +
>>> +    /* decorelate UAPI from kernel API */
>>> +    attr->key_size = sizeof(struct inode *);
>>> +
>>> +    return htab_map_alloc_check(attr);
>>> +}
>>> +
>>> +static void inode_htab_put_key(void *key)
>>> +{
>>> +    struct inode **inode = key;
>>> +
>>> +    if ((*inode)->i_state & I_FREEING)
>>> +            return;
>>
>> checking the state without take a lock? isn't it racy?
> 
> This should only trigger when called from security_inode_free(). I'll
> add a comment.
> 
>>
>>> +    iput(*inode);
>>> +}
>>> +
>>> +/* called from syscall or (never) from eBPF program */
>>> +static int map_get_next_no_key(struct bpf_map *map, void *key, void *next_key)
>>> +{
>>> +    /* do not leak a file descriptor */
>>
>> what this comment suppose to mean?
> 
> Because a key is a reference to an inode, a possible return value for
> this function could be a file descriptor pointing to this inode (the
> same way a file descriptor is use to add an element). For now, I don't
> want to implement a way for a process with such a map to extract such
> inode, which I compare to a possible leak (of information, not kernel
> memory nor object). This could be implemented in the future if there is
> value in it (and probably some additional safeguards), though.
> 
>>
>>> +    return -ENOTSUPP;
>>> +}
>>> +
>>> +/* must call iput(inode) after this call */
>>> +static struct inode *inode_from_fd(int ufd, bool check_access)
>>> +{
>>> +    struct inode *ret;
>>> +    struct fd f;
>>> +    int deny;
>>> +
>>> +    f = fdget(ufd);
>>> +    if (unlikely(!f.file))
>>> +            return ERR_PTR(-EBADF);
>>> +    /* TODO?: add this check when called from an eBPF program too (already
>>> +    * checked by the LSM parent hooks anyway) */
>>> +    if (unlikely(IS_PRIVATE(file_inode(f.file)))) {
>>> +            ret = ERR_PTR(-EINVAL);
>>> +            goto put_fd;
>>> +    }
>>> +    /* check if the FD is tied to a mount point */
>>> +    /* TODO?: add this check when called from an eBPF program too */
>>> +    if (unlikely(f.file->f_path.mnt->mnt_flags & MNT_INTERNAL)) {
>>> +            ret = ERR_PTR(-EINVAL);
>>> +            goto put_fd;
>>> +    }
>>
>> a bunch of TODOs do not inspire confidence.
> 
> I think the current implement is good, but these TODOs are here to draw
> attention on particular points for which I would like external review
> and opinion (hence the "?").
> 
>>
>>> +    if (check_access) {
>>> +            /*
>>> +            * must be allowed to access attributes from this file to then
>>> +            * be able to compare an inode to its map entry
>>> +            */
>>> +            deny = security_inode_getattr(&f.file->f_path);
>>> +            if (deny) {
>>> +                    ret = ERR_PTR(deny);
>>> +                    goto put_fd;
>>> +            }
>>> +    }
>>> +    ret = file_inode(f.file);
>>> +    ihold(ret);
>>> +
>>> +put_fd:
>>> +    fdput(f);
>>> +    return ret;
>>> +}
>>> +
>>> +/*
>>> + * The key is a FD when called from a syscall, but an inode address when called
>>> + * from an eBPF program.
>>> + */
>>> +
>>> +/* called from syscall */
>>> +int bpf_inode_fd_htab_map_lookup_elem(struct bpf_map *map, int *key, void *value)
>>> +{
>>> +    void *ptr;
>>> +    struct inode *inode;
>>> +    int ret;
>>> +
>>> +    /* check inode access */
>>> +    inode = inode_from_fd(*key, true);
>>> +    if (IS_ERR(inode))
>>> +            return PTR_ERR(inode);
>>> +
>>> +    rcu_read_lock();
>>> +    ptr = htab_map_lookup_elem(map, &inode);
>>> +    iput(inode);
>>> +    if (IS_ERR(ptr)) {
>>> +            ret = PTR_ERR(ptr);
>>> +    } else if (!ptr) {
>>> +            ret = -ENOENT;
>>> +    } else {
>>> +            ret = 0;
>>> +            copy_map_value(map, value, ptr);
>>> +    }
>>> +    rcu_read_unlock();
>>> +    return ret;
>>> +}
>>> +
>>> +/* called from kernel */
>>
>> wrong comment?
>> kernel side cannot call it, right?
> 
> This is called from bpf_inode_fd_htab_map_delete_elem() (code just
> beneath), and from
> kernel/bpf/syscall.c:bpf_inode_ptr_unlocked_htab_map_delet_elem() which
> can be called by security_inode_free() (hook_inode_free_security).
> 
>>
>>> +int bpf_inode_ptr_locked_htab_map_delete_elem(struct bpf_map *map,
>>> +            struct inode **key, bool remove_in_inode)
>>> +{
>>> +    if (remove_in_inode)
>>> +            landlock_inode_remove_map(*key, map);
>>> +    return htab_map_delete_elem(map, key);
>>> +}
>>> +
>>> +/* called from syscall */
>>> +int bpf_inode_fd_htab_map_delete_elem(struct bpf_map *map, int *key)
>>> +{
>>> +    struct inode *inode;
>>> +    int ret;
>>> +
>>> +    /* do not check inode access (similar to directory check) */
>>> +    inode = inode_from_fd(*key, false);
>>> +    if (IS_ERR(inode))
>>> +            return PTR_ERR(inode);
>>> +    ret = bpf_inode_ptr_locked_htab_map_delete_elem(map, &inode, true);
>>> +    iput(inode);
>>> +    return ret;
>>> +}
>>> +
>>> +/* called from syscall */
>>> +int bpf_inode_fd_htab_map_update_elem(struct bpf_map *map, int *key, void *value,
>>> +            u64 map_flags)
>>> +{
>>> +    struct inode *inode;
>>> +    int ret;
>>> +
>>> +    WARN_ON_ONCE(!rcu_read_lock_held());
>>> +
>>> +    /* check inode access */
>>> +    inode = inode_from_fd(*key, true);
>>> +    if (IS_ERR(inode))
>>> +            return PTR_ERR(inode);
>>> +    ret = htab_map_update_elem(map, &inode, value, map_flags);
>>> +    if (!ret)
>>> +            ret = landlock_inode_add_map(inode, map);
>>> +    iput(inode);
>>> +    return ret;
>>> +}
>>> +
>>> +static void inode_htab_map_free(struct bpf_map *map)
>>> +{
>>> +    struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>>> +    struct hlist_nulls_node *n;
>>> +    struct hlist_nulls_head *head;
>>> +    struct htab_elem *l;
>>> +    int i;
>>> +
>>> +    for (i = 0; i < htab->n_buckets; i++) {
>>> +            head = select_bucket(htab, i);
>>> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>>> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
>>> +            }
>>> +    }
>>> +    htab_map_free(map);
>>> +}
>>
>> user space can delete the map.
>> that will trigger inode_htab_map_free() which will call
>> landlock_inode_remove_map().
>> which will simply itereate the list and delete from the list.
> 
> landlock_inode_remove_map() removes the reference to the map (being
> freed) from the inode (with an RCU lock).
> 
>>
>> While in parallel inode can be destoyed and hook_inode_free_security()
>> will be called.
>> I think nothing that protects from this race.
> 
> According to security_inode_free(), the inode is effectively freed after
> the RCU grace period. However, I forgot to call bpf_map_inc() in
> landlock_inode_add_map(), which would prevent the map to be freed
> outside of the security_inode_free(). I'll fix that.
> 
>>
>>> +
>>> +/*
>>> + * We need a dedicated helper to deal with inode maps because the key is a
>>> + * pointer to an opaque data, only provided by the kernel.  This really act
>>> + * like a (physical or cryptographic) key, which is why it is also not allowed
>>> + * to get the next key with map_get_next_key().
>>
>> inode pointer is like cryptographic key? :)
> 
> I wanted to highlight the fact that, contrary to other map key types,
> the value of this one should not be readable, only usable. A "secret
> value" is more appropriate but still confusing. I'll rephrase that.
> 
>>
>>> + */
>>> +BPF_CALL_2(bpf_inode_map_lookup_elem, struct bpf_map *, map, void *, key)
>>> +{
>>> +    WARN_ON_ONCE(!rcu_read_lock_held());
>>> +    return (unsigned long)htab_map_lookup_elem(map, &key);
>>> +}
>>> +
>>> +const struct bpf_func_proto bpf_inode_map_lookup_elem_proto = {
>>> +    .func           = bpf_inode_map_lookup_elem,
>>> +    .gpl_only       = false,
>>> +    .pkt_access     = true,
>>
>> pkt_access ? :)
> 
> This slipped in with this rebase, I'll remove it. :)
> 
>>
>>> +    .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
>>> +    .arg1_type      = ARG_CONST_MAP_PTR,
>>> +    .arg2_type      = ARG_PTR_TO_INODE,
>>> +};
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index b2a8cb14f28e..e46441c42b68 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -801,6 +801,8 @@ static int map_lookup_elem(union bpf_attr *attr)
>>>      } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>>                 map->map_type == BPF_MAP_TYPE_STACK) {
>>>              err = map->ops->map_peek_elem(map, value);
>>> +    } else if (map->map_type == BPF_MAP_TYPE_INODE) {
>>> +            err = bpf_inode_fd_htab_map_lookup_elem(map, key, value);
>>>      } else {
>>>              rcu_read_lock();
>>>              if (map->ops->map_lookup_elem_sys_only)
>>> @@ -951,6 +953,10 @@ static int map_update_elem(union bpf_attr *attr)
>>>      } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>>                 map->map_type == BPF_MAP_TYPE_STACK) {
>>>              err = map->ops->map_push_elem(map, value, attr->flags);
>>> +    } else if (map->map_type == BPF_MAP_TYPE_INODE) {
>>> +            rcu_read_lock();
>>> +            err = bpf_inode_fd_htab_map_update_elem(map, key, value, attr->flags);
>>> +            rcu_read_unlock();
>>>      } else {
>>>              rcu_read_lock();
>>>              err = map->ops->map_update_elem(map, key, value, attr->flags);
>>> @@ -1006,7 +1012,10 @@ static int map_delete_elem(union bpf_attr *attr)
>>>      preempt_disable();
>>>      __this_cpu_inc(bpf_prog_active);
>>>      rcu_read_lock();
>>> -    err = map->ops->map_delete_elem(map, key);
>>> +    if (map->map_type == BPF_MAP_TYPE_INODE)
>>> +            err = bpf_inode_fd_htab_map_delete_elem(map, key);
>>> +    else
>>> +            err = map->ops->map_delete_elem(map, key);
>>>      rcu_read_unlock();
>>>      __this_cpu_dec(bpf_prog_active);
>>>      preempt_enable();
>>> @@ -1018,6 +1027,22 @@ static int map_delete_elem(union bpf_attr *attr)
>>>      return err;
>>>  }
>>>
>>> +int bpf_inode_ptr_unlocked_htab_map_delete_elem(struct bpf_map *map,
>>> +                                            struct inode **key, bool remove_in_inode)
>>> +{
>>> +    int err;
>>> +
>>> +    preempt_disable();
>>> +    __this_cpu_inc(bpf_prog_active);
>>> +    rcu_read_lock();
>>> +    err = bpf_inode_ptr_locked_htab_map_delete_elem(map, key, remove_in_inode);
>>> +    rcu_read_unlock();
>>> +    __this_cpu_dec(bpf_prog_active);
>>> +    preempt_enable();
>>> +    maybe_wait_bpf_programs(map);
>>
>> if that function was actually doing synchronize_rcu() the consequences
>> would have been unpleasant. Fortunately it's a nop in this case.
>> Please read the code carefully before copy-paste.
>> Also what do you think the reason of bpf_prog_active above?
>> What is the reason of rcu_read_lock above?
> 
> The RCU is used as for every map modifications (usually from userspace).
> I wasn't sure about the other protections so I kept the same (generic)
> checks as in map_delete_elem() (just above) because this function follow
> the same semantic. What can I safely remove?
> 
>>
>> I think the patch set needs to shrink at least in half to be reviewable.
>> The way you tie seccomp and lsm is probably the biggest obstacle
>> than any of the bugs above.
>> Can you drop seccomp ? and do it as normal lsm ?
> 
> The seccomp/enforcement part is needed to have a minimum viable product,
> i.e. a process able to sandbox itself. Are you suggesting to first merge
> a version when it is only possible to create inode maps but not use them
> in an useful way (i.e. for sandboxing)? I can do it if it's OK with you,
> and I hope it will not be a problem for the security folks if it can
> help to move forward.

I talked with Kees Cook and James Morris at LSS NA, and I think the
better strategy to shrink this patch series is to tackle a much less
complex problem at first. Instead on focusing right now on file system,
the next version of this patch series will focus on memory protection,
which is also something desired. I'll then iterate with file system
support (i.e. inode maps) and other use cases once the basics of
Landlock are upstream. For this next series, the majority of the code
will be on the LSM side, while the eBPF part will mainly consist to add
a new program type. Because bpf-next is moving rapidly, I think it still
make sense to base this work on this tree (instead of linux-security).

Regards,
 Mickaël

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Al Viro @ 2019-09-08 22:19 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Alexei Starovoitov, Mickaël Salaün, linux-kernel,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Stephen Smalley, Tejun Heo,
	Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	kernel-hardening, linux-api, linux-fsdevel, linux-security-module,
	netdev
In-Reply-To: <894922a2-1150-366c-3f08-c8b759da0742@digikod.net>

On Mon, Sep 09, 2019 at 12:09:57AM +0200, Mickaël Salaün wrote:

> >>> +    rcu_read_lock();
> >>> +    ptr = htab_map_lookup_elem(map, &inode);
> >>> +    iput(inode);

Wait a sec.  You are doing _what_ under rcu_read_lock()?

> >>> +    if (IS_ERR(ptr)) {
> >>> +            ret = PTR_ERR(ptr);
> >>> +    } else if (!ptr) {
> >>> +            ret = -ENOENT;
> >>> +    } else {
> >>> +            ret = 0;
> >>> +            copy_map_value(map, value, ptr);
> >>> +    }
> >>> +    rcu_read_unlock();

^ permalink raw reply

* Re: [RFC bpf-next 2/7] bpf: extend bpf_pcap support to tracing programs
From: Yonghong Song @ 2019-09-08 22:18 UTC (permalink / raw)
  To: Alan Maguire, ast@kernel.org, daniel@iogearbox.net, Martin Lau,
	Song Liu, davem@davemloft.net, jakub.kicinski@netronome.com,
	hawk@kernel.org, john.fastabend@gmail.com, rostedt@goodmis.org,
	mingo@redhat.com, quentin.monnet@netronome.com, Andrey Ignatov,
	joe@wand.net.nz, acme@redhat.com, jolsa@kernel.org,
	alexey.budankov@linux.intel.com, gregkh@linuxfoundation.org,
	namhyung@kernel.org, sdf@google.com, f.fainelli@gmail.com,
	shuah@kernel.org, peter@lekensteyn.nl, ivan@cloudflare.com,
	Andrii Nakryiko, bhole_prashant_q7@lab.ntt.co.jp,
	david.calavera@gmail.com, danieltimlee@gmail.com,
	Takshak Chahande, netdev@vger.kernel.org, bpf@vger.kernel.org,
	linux-kselftest@vger.kernel.org
In-Reply-To: <1567892444-16344-3-git-send-email-alan.maguire@oracle.com>



On 9/7/19 2:40 PM, Alan Maguire wrote:
> packet capture is especially valuable in tracing contexts, so
> extend bpf_pcap helper to take a tracing-derived skb pointer
> as an argument.
> 
> In the case of tracing programs, the starting protocol
> (corresponding to libpcap DLT_* values; 1 for Ethernet, 12 for
> IP, etc) needs to be specified and should reflect the protocol
> type which is pointed to by the skb->data pointer; i.e. the
> start of the packet.  This can derived in a limited set of cases,
> but should be specified where possible.  For skb and xdp programs
> this protocol will nearly always be 1 (BPF_PCAP_TYPE_ETH).
> 
> Example usage for a tracing program, where we use a
> struct bpf_pcap_hdr array map to pass in preferences for
> protocol and max len:
> 
> struct bpf_map_def SEC("maps") pcap_conf_map = {
> 	.type = BPF_MAP_TYPE_ARRAY,
> 	.key_size = sizeof(int),
> 	.value_size = sizeof(struct bpf_pcap_hdr),
> 	.max_entries = 1,
> };
> 
> struct bpf_map_def SEC("maps") pcap_map = {
> 	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> 	.key_size = sizeof(int),
> 	.value_size = sizeof(int),
> 	.max_entries = 1024,
> };
> 
> SEC("kprobe/kfree_skb")
> int probe_kfree_skb(struct pt_regs *ctx)
> {
> 	struct bpf_pcap_hdr *conf;
> 	int key = 0;
> 
> 	conf = bpf_map_lookup_elem(&pcap_conf_map, &key);
> 	if (!conf)
> 		return 0;
> 
> 	bpf_pcap((void *)PT_REGS_PARM1(ctx), conf->cap_len, &pcap_map,
> 		 conf->protocol, BPF_F_CURRENT_CPU);
> 	return 0;
> }
> 
> Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
> ---
[...]
> @@ -2977,6 +2992,8 @@ enum bpf_func_id {
>   
>   /* BPF_FUNC_pcap flags */
>   #define	BPF_F_PCAP_ID_MASK		0xffffffffffff
> +#define BPF_F_PCAP_ID_IIFINDEX		(1ULL << 48)
> +#define BPF_F_PCAP_STRICT_TYPE         (1ULL << 56)
>   
>   /* Mode for BPF_FUNC_skb_adjust_room helper. */
>   enum bpf_adj_room_mode {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d..311883b 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -13,6 +13,8 @@
>   #include <linux/kprobes.h>
>   #include <linux/syscalls.h>
>   #include <linux/error-injection.h>
> +#include <linux/skbuff.h>
> +#include <linux/ip.h>
>   
>   #include <asm/tlb.h>
>   
> @@ -530,6 +532,216 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
>   	return __bpf_perf_event_output(regs, map, flags, sd);
>   }
>   
> +/* Essentially just skb_copy_bits() using probe_kernel_read() where needed. */
> +static unsigned long bpf_trace_skb_copy(void *tobuf, const void *from,
> +					unsigned long offset,
> +					unsigned long len)
> +{
> +	const struct sk_buff *frag_iterp, *skb = from;
> +	struct skb_shared_info *shinfop, shinfo;
> +	struct sk_buff frag_iter;
> +	unsigned long copy, start;
> +	void *to = tobuf;
> +	int i, ret;
> +
> +	start = skb_headlen(skb);
> +
> +	copy = start - offset;
> +	if (copy > 0) {
> +		if (copy > len)
> +			copy = len;
> +		ret = probe_kernel_read(to, skb->data, copy);
> +		if (unlikely(ret < 0))
> +			goto out;
> +		len -= copy;
> +		if (len == 0)
> +			return 0;
> +		offset += copy;
> +		to += copy;
> +	}
> +
> +	if (skb->data_len == 0)
> +		goto out;
> +
> +	shinfop = skb_shinfo(skb);
> +
> +	ret = probe_kernel_read(&shinfo, shinfop, sizeof(shinfo));
> +	if (unlikely(ret < 0))
> +		goto out;
> +
> +	if (shinfo.nr_frags > MAX_SKB_FRAGS) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	for (i = 0; i < shinfo.nr_frags; i++) {
> +		skb_frag_t *f = &shinfo.frags[i];
> +		int end;
> +
> +		if (start > offset + len) {
> +			ret = -E2BIG;
> +			goto out;
> +		}
> +
> +		end = start + skb_frag_size(f);
> +		copy = end - offset;
> +		if (copy > 0) {
> +			u32 poff, p_len, copied;
> +			struct page *p;
> +			u8 *vaddr;
> +
> +			if (copy > len)
> +				copy = len;
> +
> +			skb_frag_foreach_page(f,
> +					      skb_frag_off(f) + offset - start,
> +					      copy, p, poff, p_len, copied) {
> +
> +				vaddr = kmap_atomic(p);
> +				ret = probe_kernel_read(to + copied,
> +							vaddr + poff, p_len);
> +				kunmap_atomic(vaddr);
> +
> +				if (unlikely(ret < 0))
> +					goto out;
> +			}
> +			len -= copy;
> +			if (len == 0)
> +				return 0;
> +			offset += copy;
> +			to += copy;
> +		}
> +		start = end;
> +	}
> +
> +	for (frag_iterp = shinfo.frag_list; frag_iterp;
> +	     frag_iterp = frag_iter.next) {
> +		int end;
> +
> +		if (start > offset + len) {
> +			ret = -E2BIG;
> +			goto out;
> +		}
> +		ret = probe_kernel_read(&frag_iter, frag_iterp,
> +					sizeof(frag_iter));
> +		if (ret)
> +			goto out;
> +
> +		end = start + frag_iter.len;
> +		copy = end - offset;
> +		if (copy > 0) {
> +			if (copy > len)
> +				copy = len;
> +			ret = bpf_trace_skb_copy(to, &frag_iter,
> +						offset - start,
> +						copy);
> +			if (ret)
> +				goto out;
> +
> +			len -= copy;
> +			if (len == 0)
> +				return 0;
> +			offset += copy;
> +			to += copy;
> +		}
> +		start = end;
> +	}
> +out:
> +	if (ret)
> +		memset(tobuf, 0, len);
> +
> +	return ret;
> +}

For net side bpf_perf_event_output, we have
static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
                                   unsigned long off, unsigned long len)
{
         void *ptr = skb_header_pointer(skb, off, len, dst_buff);

         if (unlikely(!ptr))
                 return len;
         if (ptr != dst_buff)
                 memcpy(dst_buff, ptr, len);

         return 0;
}

BPF_CALL_5(bpf_skb_event_output, struct sk_buff *, skb, struct bpf_map 
*, map,
            u64, flags, void *, meta, u64, meta_size)
{
         u64 skb_size = (flags & BPF_F_CTXLEN_MASK) >> 32;

         if (unlikely(flags & ~(BPF_F_CTXLEN_MASK | BPF_F_INDEX_MASK)))
                 return -EINVAL;
         if (unlikely(skb_size > skb->len))
                 return -EFAULT;

         return bpf_event_output(map, flags, meta, meta_size, skb, skb_size,
                                 bpf_skb_copy);
}

It does not really consider output all the frags.
I understand that to get truly all packet data, frags should be
considered, but seems we did not do it before? I am wondering
whether we need to do here.

If we indeed do not need to handle frags here, I think maybe
bpf_probe_read() in existing bpf kprobe function should be
enough, we do not need this helper?

> +
> +/* Derive protocol for some of the easier cases.  For tracing, a probe point
> + * may be dealing with packets in various states. Common cases are IP
> + * packets prior to adding MAC header (_PCAP_TYPE_IP) and a full packet
> + * (_PCAP_TYPE_ETH).  For other cases the caller must specify the
> + * protocol they expect.  Other heuristics for packet identification
> + * should be added here as needed, since determining the packet type
> + * ensures we do not capture packets that fail to match the desired
> + * pcap type in BPF_F_PCAP_STRICT_TYPE mode.
> + */
> +static inline int bpf_skb_protocol_get(struct sk_buff *skb)
> +{
> +	switch (htons(skb->protocol)) {
> +	case ETH_P_IP:
> +	case ETH_P_IPV6:
> +		if (skb_network_header(skb) == skb->data)
> +			return BPF_PCAP_TYPE_IP;
> +		else
> +			return BPF_PCAP_TYPE_ETH;
> +	default:
> +		return BPF_PCAP_TYPE_UNSET;
> +	}
> +}
> +
> +BPF_CALL_5(bpf_trace_pcap, void *, data, u32, size, struct bpf_map *, map,
> +	   int, protocol_wanted, u64, flags)

Up to now, for helpers, verifier has a way to verifier it is used 
properly regarding to the context. For example, for xdp version
perf_event_output, the help prototype,
   BPF_CALL_5(bpf_xdp_event_output, struct xdp_buff *, xdp, struct 
bpf_map *, map,
            u64, flags, void *, meta, u64, meta_size)
the verifier is able to guarantee that the first parameter
has correct type xdp_buff, not something from type cast.
   .arg1_type      = ARG_PTR_TO_CTX,

This helper, in the below we have
   .arg1_type	= ARG_ANYTHING,

So it is not really enforced. Bringing BTF can help, but type
name matching typically bad.


> +{
> +	struct bpf_pcap_hdr pcap;
> +	struct sk_buff skb;
> +	int protocol;
> +	int ret;
> +
> +	if (unlikely(flags & ~(BPF_F_PCAP_ID_IIFINDEX | BPF_F_PCAP_ID_MASK |
> +			       BPF_F_PCAP_STRICT_TYPE)))
> +		return -EINVAL;
> +
> +	ret = probe_kernel_read(&skb, data, sizeof(skb));
> +	if (unlikely(ret < 0))
> +		return ret;
> +
> +	/* Sanity check skb len in case we get bogus data. */
> +	if (unlikely(!skb.len))
> +		return -ENOENT;
> +	if (unlikely(skb.len > GSO_MAX_SIZE || skb.data_len > skb.len))
> +		return -E2BIG;
> +
> +	protocol = bpf_skb_protocol_get(&skb);
> +
> +	if (protocol_wanted == BPF_PCAP_TYPE_UNSET) {
> +		/* If we cannot determine protocol type, bail. */
> +		if (protocol == BPF_PCAP_TYPE_UNSET)
> +			return -EPROTO;
> +	} else {
> +		/* if we determine protocol type, and it's not what we asked
> +		 * for _and_ we are in strict mode, bail.  Otherwise we assume
> +		 * the packet is the requested protocol type and drive on.
> +		 */
> +		if (flags & BPF_F_PCAP_STRICT_TYPE &&
> +		    protocol != BPF_PCAP_TYPE_UNSET &&
> +		    protocol != protocol_wanted)
> +			return -EPROTO;
> +		protocol = protocol_wanted;
> +	}
> +
> +	/* If we specified a matching incoming ifindex, bail if not a match. */
> +	if (flags & BPF_F_PCAP_ID_IIFINDEX) {
> +		int iif = flags & BPF_F_PCAP_ID_MASK;
> +
> +		if (iif && skb.skb_iif != iif)
> +			return -ENOENT;
> +	}
> +
> +	ret = bpf_pcap_prepare(protocol, size, skb.len, flags, &pcap);
> +	if (ret)
> +		return ret;
> +
> +	return bpf_event_output(map, BPF_F_CURRENT_CPU, &pcap, sizeof(pcap),
> +				&skb, pcap.cap_len, bpf_trace_skb_copy);
> +}
> +
> +static const struct bpf_func_proto bpf_trace_pcap_proto = {
> +	.func		= bpf_trace_pcap,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_ANYTHING,
> +	.arg2_type	= ARG_ANYTHING,
> +	.arg3_type	= ARG_CONST_MAP_PTR,
> +	.arg4_type	= ARG_ANYTHING,
> +	.arg5_type	= ARG_ANYTHING,
> +};
> +
>   BPF_CALL_0(bpf_get_current_task)
>   {
>   	return (long) current;
> @@ -709,6 +921,8 @@ static void do_bpf_send_signal(struct irq_work *entry)
>   #endif
>   	case BPF_FUNC_send_signal:
>   		return &bpf_send_signal_proto;
> +	case BPF_FUNC_pcap:
> +		return &bpf_trace_pcap_proto;
>   	default:
>   		return NULL;
>   	}
> 

^ permalink raw reply

* [PATCH net-next v2 00/11] nfp: implement firmware loading policy
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman

Dirk says:

This series adds configuration capabilities to the firmware loading policy of
the NFP driver.

NFP firmware loading is controlled via three HWinfo keys which can be set per
device: 'abi_drv_reset', 'abi_drv_load_ifc' and 'app_fw_from_flash'.
Refer to patch #11 for more detail on how these control the firmware loading.

In order to configure the full extend of FW loading policy, a new devlink
parameter has been introduced, 'reset_dev_on_drv_probe', which controls if the
driver should reset the device when it's probed. This, in conjunction with the
existing 'fw_load_policy' (extended to include a 'disk' option) provides the
means to tweak the NFP HWinfo keys as required by users.

Patches 1 and 2 adds the devlink modifications and patches 3 through 9 adds the
support into the NFP driver. Furthermore, the last 2 patches are documentation
only.

v2:
  Renamed all 'reset_dev_on_drv_probe' defines the same as the devlink parameter
  name (Jiri)

Dirk van der Merwe (11):
  devlink: extend 'fw_load_policy' values
  devlink: add 'reset_dev_on_drv_probe' param
  nfp: nsp: add support for fw_loaded command
  nfp: nsp: add support for optional hwinfo lookup
  nfp: nsp: add support for hwinfo set operation
  nfp: honor FW reset and loading policies
  nfp: add devlink param infrastructure
  nfp: devlink: add 'fw_load_policy' support
  nfp: devlink: add 'reset_dev_on_drv_probe' support
  kdoc: fix nfp_fw_load documentation
  Documentation: nfp: add nfp driver specific notes

 .../networking/device_drivers/netronome/nfp.rst    | 133 +++++++++++
 Documentation/networking/devlink-params-nfp.txt    |   5 +
 Documentation/networking/devlink-params.txt        |  16 ++
 drivers/net/ethernet/netronome/nfp/Makefile        |   1 +
 drivers/net/ethernet/netronome/nfp/devlink_param.c | 254 +++++++++++++++++++++
 drivers/net/ethernet/netronome/nfp/nfp_main.c      | 142 +++++++++---
 drivers/net/ethernet/netronome/nfp/nfp_main.h      |   5 +
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c  |   7 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c   |  77 ++++++-
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h   |  29 +++
 include/net/devlink.h                              |   5 +
 include/uapi/linux/devlink.h                       |   8 +
 net/core/devlink.c                                 |   5 +
 13 files changed, 657 insertions(+), 30 deletions(-)
 create mode 100644 Documentation/networking/device_drivers/netronome/nfp.rst
 create mode 100644 Documentation/networking/devlink-params-nfp.txt
 create mode 100644 drivers/net/ethernet/netronome/nfp/devlink_param.c

-- 
2.11.0


^ permalink raw reply

* [PATCH net-next v2 01/11] devlink: extend 'fw_load_policy' values
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Add the 'disk' value to the generic 'fw_load_policy' devlink parameter.
This value indicates that firmware should always be loaded from disk
only.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 Documentation/networking/devlink-params.txt | 2 ++
 include/uapi/linux/devlink.h                | 1 +
 2 files changed, 3 insertions(+)

diff --git a/Documentation/networking/devlink-params.txt b/Documentation/networking/devlink-params.txt
index 2d26434ddcf8..fadb5436188d 100644
--- a/Documentation/networking/devlink-params.txt
+++ b/Documentation/networking/devlink-params.txt
@@ -48,4 +48,6 @@ fw_load_policy		[DEVICE, GENERIC]
 			  Load firmware version preferred by the driver.
 			* DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH (1)
 			  Load firmware currently stored in flash.
+			* DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK (2)
+			  Load firmware currently available on host's disk.
 			Type: u8
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 546e75dd74ac..c25cc29a6647 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -202,6 +202,7 @@ enum devlink_param_cmode {
 enum devlink_param_fw_load_policy_value {
 	DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER,
 	DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH,
+	DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK,
 };
 
 enum {
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2 02/11] devlink: add 'reset_dev_on_drv_probe' param
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Add the 'reset_dev_on_drv_probe' devlink parameter, controlling the
device reset policy on driver probe.

This parameter is useful in conjunction with the existing
'fw_load_policy' parameter.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 Documentation/networking/devlink-params.txt | 14 ++++++++++++++
 include/net/devlink.h                       |  5 +++++
 include/uapi/linux/devlink.h                |  7 +++++++
 net/core/devlink.c                          |  5 +++++
 4 files changed, 31 insertions(+)

diff --git a/Documentation/networking/devlink-params.txt b/Documentation/networking/devlink-params.txt
index fadb5436188d..ddba3e9b55b1 100644
--- a/Documentation/networking/devlink-params.txt
+++ b/Documentation/networking/devlink-params.txt
@@ -51,3 +51,17 @@ fw_load_policy		[DEVICE, GENERIC]
 			* DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK (2)
 			  Load firmware currently available on host's disk.
 			Type: u8
+
+reset_dev_on_drv_probe	[DEVICE, GENERIC]
+			Controls the device's reset policy on driver probe.
+			Valid values:
+			* DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_UNKNOWN (0)
+			  Unknown or invalid value.
+			* DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS (1)
+			  Always reset device on driver probe.
+			* DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_NEVER (2)
+			  Never reset device on driver probe.
+			* DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK (3)
+			  Reset only if device firmware can be found in the
+			  filesystem.
+			Type: u8
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 460bc629d1a4..03e4d9244ff3 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -398,6 +398,7 @@ enum devlink_param_generic_id {
 	DEVLINK_PARAM_GENERIC_ID_MSIX_VEC_PER_PF_MAX,
 	DEVLINK_PARAM_GENERIC_ID_MSIX_VEC_PER_PF_MIN,
 	DEVLINK_PARAM_GENERIC_ID_FW_LOAD_POLICY,
+	DEVLINK_PARAM_GENERIC_ID_RESET_DEV_ON_DRV_PROBE,
 
 	/* add new param generic ids above here*/
 	__DEVLINK_PARAM_GENERIC_ID_MAX,
@@ -428,6 +429,10 @@ enum devlink_param_generic_id {
 #define DEVLINK_PARAM_GENERIC_FW_LOAD_POLICY_NAME "fw_load_policy"
 #define DEVLINK_PARAM_GENERIC_FW_LOAD_POLICY_TYPE DEVLINK_PARAM_TYPE_U8
 
+#define DEVLINK_PARAM_GENERIC_RESET_DEV_ON_DRV_PROBE_NAME \
+	"reset_dev_on_drv_probe"
+#define DEVLINK_PARAM_GENERIC_RESET_DEV_ON_DRV_PROBE_TYPE DEVLINK_PARAM_TYPE_U8
+
 #define DEVLINK_PARAM_GENERIC(_id, _cmodes, _get, _set, _validate)	\
 {									\
 	.id = DEVLINK_PARAM_GENERIC_ID_##_id,				\
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index c25cc29a6647..1da3e83f1fd4 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -205,6 +205,13 @@ enum devlink_param_fw_load_policy_value {
 	DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK,
 };
 
+enum devlink_param_reset_dev_on_drv_probe_value {
+	DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_UNKNOWN,
+	DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS,
+	DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_NEVER,
+	DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK,
+};
+
 enum {
 	DEVLINK_ATTR_STATS_RX_PACKETS,		/* u64 */
 	DEVLINK_ATTR_STATS_RX_BYTES,		/* u64 */
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 6e52d639dac6..4a2fb94c44cf 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -2852,6 +2852,11 @@ static const struct devlink_param devlink_param_generic[] = {
 		.name = DEVLINK_PARAM_GENERIC_FW_LOAD_POLICY_NAME,
 		.type = DEVLINK_PARAM_GENERIC_FW_LOAD_POLICY_TYPE,
 	},
+	{
+		.id = DEVLINK_PARAM_GENERIC_ID_RESET_DEV_ON_DRV_PROBE,
+		.name = DEVLINK_PARAM_GENERIC_RESET_DEV_ON_DRV_PROBE_NAME,
+		.type = DEVLINK_PARAM_GENERIC_RESET_DEV_ON_DRV_PROBE_TYPE,
+	},
 };
 
 static int devlink_param_generic_verify(const struct devlink_param *param)
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2 03/11] nfp: nsp: add support for fw_loaded command
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Add support for the simple command that indicates whether application
firmware is loaded.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 10 ++++++++++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h |  6 ++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
index 9a08623c325d..b4c5dc5f7404 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
@@ -96,6 +96,7 @@ enum nfp_nsp_cmd {
 	SPCODE_NSP_IDENTIFY	= 13, /* Read NSP version */
 	SPCODE_FW_STORED	= 16, /* If no FW loaded, load flash app FW */
 	SPCODE_HWINFO_LOOKUP	= 17, /* Lookup HWinfo with overwrites etc. */
+	SPCODE_FW_LOADED	= 19, /* Is application firmware loaded */
 	SPCODE_VERSIONS		= 21, /* Report FW versions */
 	SPCODE_READ_SFF_EEPROM	= 22, /* Read module EEPROM */
 };
@@ -925,6 +926,15 @@ int nfp_nsp_hwinfo_lookup(struct nfp_nsp *state, void *buf, unsigned int size)
 	return 0;
 }
 
+int nfp_nsp_fw_loaded(struct nfp_nsp *state)
+{
+	const struct nfp_nsp_command_arg arg = {
+		.code		= SPCODE_FW_LOADED,
+	};
+
+	return __nfp_nsp_command(state, &arg);
+}
+
 int nfp_nsp_versions(struct nfp_nsp *state, void *buf, unsigned int size)
 {
 	struct nfp_nsp_command_buf_arg versions = {
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
index 22ee6985ee1c..4ceecff347c6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
@@ -22,6 +22,7 @@ int nfp_nsp_write_flash(struct nfp_nsp *state, const struct firmware *fw);
 int nfp_nsp_mac_reinit(struct nfp_nsp *state);
 int nfp_nsp_load_stored_fw(struct nfp_nsp *state);
 int nfp_nsp_hwinfo_lookup(struct nfp_nsp *state, void *buf, unsigned int size);
+int nfp_nsp_fw_loaded(struct nfp_nsp *state);
 int nfp_nsp_read_module_eeprom(struct nfp_nsp *state, int eth_index,
 			       unsigned int offset, void *data,
 			       unsigned int len, unsigned int *read_len);
@@ -41,6 +42,11 @@ static inline bool nfp_nsp_has_hwinfo_lookup(struct nfp_nsp *state)
 	return nfp_nsp_get_abi_ver_minor(state) > 24;
 }
 
+static inline bool nfp_nsp_has_fw_loaded(struct nfp_nsp *state)
+{
+	return nfp_nsp_get_abi_ver_minor(state) > 25;
+}
+
 static inline bool nfp_nsp_has_versions(struct nfp_nsp *state)
 {
 	return nfp_nsp_get_abi_ver_minor(state) > 27;
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2 04/11] nfp: nsp: add support for optional hwinfo lookup
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

There are cases where we want to read a hwinfo entry from the NFP, and
if it doesn't exist, use a default value instead.

To support this, we must silence warning/error messages when the hwinfo
entry doesn't exist since this is a valid use case. The NSP command
structure provides the ability to silence command errors, in which case
the caller should log any command errors appropriately. Protocol errors
are unaffected by this.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c   | 52 ++++++++++++++++++++--
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h   |  2 +
 2 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
index b4c5dc5f7404..deee91cbf1b2 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
@@ -144,6 +144,8 @@ struct nfp_nsp {
  * @option:	NFP SP Command Argument
  * @buf:	NFP SP Buffer Address
  * @error_cb:	Callback for interpreting option if error occurred
+ * @error_quiet:Don't print command error/warning. Protocol errors are still
+ *		    logged.
  */
 struct nfp_nsp_command_arg {
 	u16 code;
@@ -152,6 +154,7 @@ struct nfp_nsp_command_arg {
 	u32 option;
 	u64 buf;
 	void (*error_cb)(struct nfp_nsp *state, u32 ret_val);
+	bool error_quiet;
 };
 
 /**
@@ -406,8 +409,10 @@ __nfp_nsp_command(struct nfp_nsp *state, const struct nfp_nsp_command_arg *arg)
 
 	err = FIELD_GET(NSP_STATUS_RESULT, reg);
 	if (err) {
-		nfp_warn(cpp, "Result (error) code set: %d (%d) command: %d\n",
-			 -err, (int)ret_val, arg->code);
+		if (!arg->error_quiet)
+			nfp_warn(cpp, "Result (error) code set: %d (%d) command: %d\n",
+				 -err, (int)ret_val, arg->code);
+
 		if (arg->error_cb)
 			arg->error_cb(state, ret_val);
 		else
@@ -892,12 +897,14 @@ int nfp_nsp_load_stored_fw(struct nfp_nsp *state)
 }
 
 static int
-__nfp_nsp_hwinfo_lookup(struct nfp_nsp *state, void *buf, unsigned int size)
+__nfp_nsp_hwinfo_lookup(struct nfp_nsp *state, void *buf, unsigned int size,
+			bool optional)
 {
 	struct nfp_nsp_command_buf_arg hwinfo_lookup = {
 		{
 			.code		= SPCODE_HWINFO_LOOKUP,
 			.option		= size,
+			.error_quiet	= optional,
 		},
 		.in_buf		= buf,
 		.in_size	= size,
@@ -914,7 +921,7 @@ int nfp_nsp_hwinfo_lookup(struct nfp_nsp *state, void *buf, unsigned int size)
 
 	size = min_t(u32, size, NFP_HWINFO_LOOKUP_SIZE);
 
-	err = __nfp_nsp_hwinfo_lookup(state, buf, size);
+	err = __nfp_nsp_hwinfo_lookup(state, buf, size, false);
 	if (err)
 		return err;
 
@@ -926,6 +933,43 @@ int nfp_nsp_hwinfo_lookup(struct nfp_nsp *state, void *buf, unsigned int size)
 	return 0;
 }
 
+int nfp_nsp_hwinfo_lookup_optional(struct nfp_nsp *state, void *buf,
+				   unsigned int size, const char *default_val)
+{
+	int err;
+
+	/* Ensure that the default value is usable irrespective of whether
+	 * it is actually going to be used.
+	 */
+	if (strnlen(default_val, size) == size)
+		return -EINVAL;
+
+	if (!nfp_nsp_has_hwinfo_lookup(state)) {
+		strcpy(buf, default_val);
+		return 0;
+	}
+
+	size = min_t(u32, size, NFP_HWINFO_LOOKUP_SIZE);
+
+	err = __nfp_nsp_hwinfo_lookup(state, buf, size, true);
+	if (err) {
+		if (err == -ENOENT) {
+			strcpy(buf, default_val);
+			return 0;
+		}
+
+		nfp_err(state->cpp, "NSP HWinfo lookup failed: %d\n", err);
+		return err;
+	}
+
+	if (strnlen(buf, size) == size) {
+		nfp_err(state->cpp, "NSP HWinfo value not NULL-terminated\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 int nfp_nsp_fw_loaded(struct nfp_nsp *state)
 {
 	const struct nfp_nsp_command_arg arg = {
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
index 4ceecff347c6..a8985c2eb1f1 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
@@ -22,6 +22,8 @@ int nfp_nsp_write_flash(struct nfp_nsp *state, const struct firmware *fw);
 int nfp_nsp_mac_reinit(struct nfp_nsp *state);
 int nfp_nsp_load_stored_fw(struct nfp_nsp *state);
 int nfp_nsp_hwinfo_lookup(struct nfp_nsp *state, void *buf, unsigned int size);
+int nfp_nsp_hwinfo_lookup_optional(struct nfp_nsp *state, void *buf,
+				   unsigned int size, const char *default_val);
 int nfp_nsp_fw_loaded(struct nfp_nsp *state);
 int nfp_nsp_read_module_eeprom(struct nfp_nsp *state, int eth_index,
 			       unsigned int offset, void *data,
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2 05/11] nfp: nsp: add support for hwinfo set operation
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Add support for the NSP HWinfo set command. This closely follows the
HWinfo lookup command.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 15 +++++++++++++++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h |  6 ++++++
 2 files changed, 21 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
index deee91cbf1b2..f18e787fa9ad 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
@@ -96,6 +96,7 @@ enum nfp_nsp_cmd {
 	SPCODE_NSP_IDENTIFY	= 13, /* Read NSP version */
 	SPCODE_FW_STORED	= 16, /* If no FW loaded, load flash app FW */
 	SPCODE_HWINFO_LOOKUP	= 17, /* Lookup HWinfo with overwrites etc. */
+	SPCODE_HWINFO_SET	= 18, /* Set HWinfo entry */
 	SPCODE_FW_LOADED	= 19, /* Is application firmware loaded */
 	SPCODE_VERSIONS		= 21, /* Report FW versions */
 	SPCODE_READ_SFF_EEPROM	= 22, /* Read module EEPROM */
@@ -970,6 +971,20 @@ int nfp_nsp_hwinfo_lookup_optional(struct nfp_nsp *state, void *buf,
 	return 0;
 }
 
+int nfp_nsp_hwinfo_set(struct nfp_nsp *state, void *buf, unsigned int size)
+{
+	struct nfp_nsp_command_buf_arg hwinfo_set = {
+		{
+			.code		= SPCODE_HWINFO_SET,
+			.option		= size,
+		},
+		.in_buf		= buf,
+		.in_size	= size,
+	};
+
+	return nfp_nsp_command_buf(state, &hwinfo_set);
+}
+
 int nfp_nsp_fw_loaded(struct nfp_nsp *state)
 {
 	const struct nfp_nsp_command_arg arg = {
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
index a8985c2eb1f1..055fda05880d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
@@ -24,6 +24,7 @@ int nfp_nsp_load_stored_fw(struct nfp_nsp *state);
 int nfp_nsp_hwinfo_lookup(struct nfp_nsp *state, void *buf, unsigned int size);
 int nfp_nsp_hwinfo_lookup_optional(struct nfp_nsp *state, void *buf,
 				   unsigned int size, const char *default_val);
+int nfp_nsp_hwinfo_set(struct nfp_nsp *state, void *buf, unsigned int size);
 int nfp_nsp_fw_loaded(struct nfp_nsp *state);
 int nfp_nsp_read_module_eeprom(struct nfp_nsp *state, int eth_index,
 			       unsigned int offset, void *data,
@@ -44,6 +45,11 @@ static inline bool nfp_nsp_has_hwinfo_lookup(struct nfp_nsp *state)
 	return nfp_nsp_get_abi_ver_minor(state) > 24;
 }
 
+static inline bool nfp_nsp_has_hwinfo_set(struct nfp_nsp *state)
+{
+	return nfp_nsp_get_abi_ver_minor(state) > 25;
+}
+
 static inline bool nfp_nsp_has_fw_loaded(struct nfp_nsp *state)
 {
 	return nfp_nsp_get_abi_ver_minor(state) > 25;
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2 06/11] nfp: honor FW reset and loading policies
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

The firmware reset and loading policies can be controlled with the
combination of three hwinfo keys, 'abi_drv_reset', 'abi_drv_load_ifc'
and 'app_fw_from_flash'.

'app_fw_from_flash' defines which firmware should take precedence,
'Disk', 'Flash' or the 'Preferred' firmware. When 'Preferred'
is selected, the management firmware makes the decision on which
firmware will be loaded by comparing versions of the flash firmware
and the host supplied firmware.

'abi_drv_reset' defines when the driver should reset the firmware when
the driver is probed, either 'Disk' if firmware was found on disk,
'Always' reset or 'Never' reset. Note that the device is always reset
on driver unload if firmware was loaded when the driver was probed.

'abi_drv_load_ifc' defines a list of PF devices allowed to load FW on
the device.

Furthermore, we limit the cases to where the driver will unload firmware
again when the driver is removed to only when firmware was loaded by the
driver and only if this particular device was the only one that could
have loaded firmware. This is needed to avoid firmware being removed
while in use on multi-host platforms.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c      | 140 +++++++++++++++++----
 drivers/net/ethernet/netronome/nfp/nfp_main.h      |   2 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h   |  15 +++
 3 files changed, 132 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index 81679647e842..969850f8fe51 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -352,7 +352,7 @@ nfp_net_fw_request(struct pci_dev *pdev, struct nfp_pf *pf, const char *name)
 
 	err = request_firmware_direct(&fw, name, &pdev->dev);
 	nfp_info(pf->cpp, "  %s: %s\n",
-		 name, err ? "not found" : "found, loading...");
+		 name, err ? "not found" : "found");
 	if (err)
 		return NULL;
 
@@ -430,6 +430,33 @@ nfp_net_fw_find(struct pci_dev *pdev, struct nfp_pf *pf)
 	return nfp_net_fw_request(pdev, pf, fw_name);
 }
 
+static int
+nfp_get_fw_policy_value(struct pci_dev *pdev, struct nfp_nsp *nsp,
+			const char *key, const char *default_val, int max_val,
+			int *value)
+{
+	char hwinfo[64];
+	long hi_val;
+	int err;
+
+	snprintf(hwinfo, sizeof(hwinfo), key);
+	err = nfp_nsp_hwinfo_lookup_optional(nsp, hwinfo, sizeof(hwinfo),
+					     default_val);
+	if (err)
+		return err;
+
+	err = kstrtol(hwinfo, 0, &hi_val);
+	if (err || hi_val < 0 || hi_val > max_val) {
+		dev_warn(&pdev->dev,
+			 "Invalid value '%s' from '%s', ignoring\n",
+			 hwinfo, key);
+		err = kstrtol(default_val, 0, &hi_val);
+	}
+
+	*value = hi_val;
+	return err;
+}
+
 /**
  * nfp_net_fw_load() - Load the firmware image
  * @pdev:       PCI Device structure
@@ -441,44 +468,107 @@ nfp_net_fw_find(struct pci_dev *pdev, struct nfp_pf *pf)
 static int
 nfp_fw_load(struct pci_dev *pdev, struct nfp_pf *pf, struct nfp_nsp *nsp)
 {
-	const struct firmware *fw;
+	bool do_reset, fw_loaded = false;
+	const struct firmware *fw = NULL;
+	int err, reset, policy, ifcs = 0;
+	char *token, *ptr;
+	char hwinfo[64];
 	u16 interface;
-	int err;
+
+	snprintf(hwinfo, sizeof(hwinfo), "abi_drv_load_ifc");
+	err = nfp_nsp_hwinfo_lookup_optional(nsp, hwinfo, sizeof(hwinfo),
+					     NFP_NSP_DRV_LOAD_IFC_DEFAULT);
+	if (err)
+		return err;
 
 	interface = nfp_cpp_interface(pf->cpp);
-	if (NFP_CPP_INTERFACE_UNIT_of(interface) != 0) {
-		/* Only Unit 0 should reset or load firmware */
+	ptr = hwinfo;
+	while ((token = strsep(&ptr, ","))) {
+		unsigned long interface_hi;
+
+		err = kstrtoul(token, 0, &interface_hi);
+		if (err) {
+			dev_err(&pdev->dev,
+				"Failed to parse interface '%s': %d\n",
+				token, err);
+			return err;
+		}
+
+		ifcs++;
+		if (interface == interface_hi)
+			break;
+	}
+
+	if (!token) {
 		dev_info(&pdev->dev, "Firmware will be loaded by partner\n");
 		return 0;
 	}
 
+	err = nfp_get_fw_policy_value(pdev, nsp, "abi_drv_reset",
+				      NFP_NSP_DRV_RESET_DEFAULT,
+				      NFP_NSP_DRV_RESET_NEVER, &reset);
+	if (err)
+		return err;
+
+	err = nfp_get_fw_policy_value(pdev, nsp, "app_fw_from_flash",
+				      NFP_NSP_APP_FW_LOAD_DEFAULT,
+				      NFP_NSP_APP_FW_LOAD_PREF, &policy);
+	if (err)
+		return err;
+
 	fw = nfp_net_fw_find(pdev, pf);
-	if (!fw) {
-		if (nfp_nsp_has_stored_fw_load(nsp))
-			nfp_nsp_load_stored_fw(nsp);
-		return 0;
+	do_reset = reset == NFP_NSP_DRV_RESET_ALWAYS ||
+		   (fw && reset == NFP_NSP_DRV_RESET_DISK);
+
+	if (do_reset) {
+		dev_info(&pdev->dev, "Soft-resetting the NFP\n");
+		err = nfp_nsp_device_soft_reset(nsp);
+		if (err < 0) {
+			dev_err(&pdev->dev,
+				"Failed to soft reset the NFP: %d\n", err);
+			goto exit_release_fw;
+		}
 	}
 
-	dev_info(&pdev->dev, "Soft-reset, loading FW image\n");
-	err = nfp_nsp_device_soft_reset(nsp);
-	if (err < 0) {
-		dev_err(&pdev->dev, "Failed to soft reset the NFP: %d\n",
-			err);
-		goto exit_release_fw;
-	}
+	if (fw && policy != NFP_NSP_APP_FW_LOAD_FLASH) {
+		if (nfp_nsp_has_fw_loaded(nsp) && nfp_nsp_fw_loaded(nsp))
+			goto exit_release_fw;
 
-	err = nfp_nsp_load_fw(nsp, fw);
-	if (err < 0) {
-		dev_err(&pdev->dev, "FW loading failed: %d\n", err);
-		goto exit_release_fw;
+		err = nfp_nsp_load_fw(nsp, fw);
+		if (err < 0) {
+			dev_err(&pdev->dev, "FW loading failed: %d\n",
+				err);
+			goto exit_release_fw;
+		}
+		dev_info(&pdev->dev, "Finished loading FW image\n");
+		fw_loaded = true;
+	} else if (policy != NFP_NSP_APP_FW_LOAD_DISK &&
+		   nfp_nsp_has_stored_fw_load(nsp)) {
+
+		/* Don't propagate this error to stick with legacy driver
+		 * behavior, failure will be detected later during init.
+		 */
+		if (!nfp_nsp_load_stored_fw(nsp))
+			dev_info(&pdev->dev, "Finished loading stored FW image\n");
+
+		/* Don't flag the fw_loaded in this case since other devices
+		 * may reuse the firmware when configured this way
+		 */
+	} else {
+		dev_warn(&pdev->dev, "Didn't load firmware, please update flash or reconfigure card\n");
 	}
 
-	dev_info(&pdev->dev, "Finished loading FW image\n");
-
 exit_release_fw:
 	release_firmware(fw);
 
-	return err < 0 ? err : 1;
+	/* We don't want to unload firmware when other devices may still be
+	 * dependent on it, which could be the case if there are multiple
+	 * devices that could load firmware.
+	 */
+	if (fw_loaded && ifcs == 1)
+		pf->unload_fw_on_remove = true;
+
+	return err < 0 ? err : fw_loaded;
 }
 
 static void
@@ -704,7 +794,7 @@ static int nfp_pci_probe(struct pci_dev *pdev,
 err_fw_unload:
 	kfree(pf->rtbl);
 	nfp_mip_close(pf->mip);
-	if (pf->fw_loaded)
+	if (pf->unload_fw_on_remove)
 		nfp_fw_unload(pf);
 	kfree(pf->eth_tbl);
 	kfree(pf->nspi);
@@ -744,7 +834,7 @@ static void __nfp_pci_shutdown(struct pci_dev *pdev, bool unload_fw)
 	vfree(pf->dumpspec);
 	kfree(pf->rtbl);
 	nfp_mip_close(pf->mip);
-	if (unload_fw && pf->fw_loaded)
+	if (unload_fw && pf->unload_fw_on_remove)
 		nfp_fw_unload(pf);
 
 	destroy_workqueue(pf->wq);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.h b/drivers/net/ethernet/netronome/nfp/nfp_main.h
index b7211f200d22..bd6450b0f23f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.h
@@ -64,6 +64,7 @@ struct nfp_dumpspec {
  * @limit_vfs:		Number of VFs supported by firmware (~0 for PCI limit)
  * @num_vfs:		Number of SR-IOV VFs enabled
  * @fw_loaded:		Is the firmware loaded?
+ * @unload_fw_on_remove:Do we need to unload firmware on driver removal?
  * @ctrl_vnic:		Pointer to the control vNIC if available
  * @mip:		MIP handle
  * @rtbl:		RTsym table
@@ -110,6 +111,7 @@ struct nfp_pf {
 	unsigned int num_vfs;
 
 	bool fw_loaded;
+	bool unload_fw_on_remove;
 
 	struct nfp_net *ctrl_vnic;
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
index 055fda05880d..1531c1870020 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h
@@ -102,6 +102,21 @@ enum nfp_eth_fec {
 #define NFP_FEC_REED_SOLOMON	BIT(NFP_FEC_REED_SOLOMON_BIT)
 #define NFP_FEC_DISABLED	BIT(NFP_FEC_DISABLED_BIT)
 
+/* Defines the valid values of the 'abi_drv_reset' hwinfo key */
+#define NFP_NSP_DRV_RESET_DISK			0
+#define NFP_NSP_DRV_RESET_ALWAYS		1
+#define NFP_NSP_DRV_RESET_NEVER			2
+#define NFP_NSP_DRV_RESET_DEFAULT		"0"
+
+/* Defines the valid values of the 'app_fw_from_flash' hwinfo key */
+#define NFP_NSP_APP_FW_LOAD_DISK		0
+#define NFP_NSP_APP_FW_LOAD_FLASH		1
+#define NFP_NSP_APP_FW_LOAD_PREF		2
+#define NFP_NSP_APP_FW_LOAD_DEFAULT		"2"
+
+/* Define the default value for the 'abi_drv_load_ifc' key */
+#define NFP_NSP_DRV_LOAD_IFC_DEFAULT		"0x10ff"
+
 /**
  * struct nfp_eth_table - ETH table information
  * @count:	number of table entries
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2 07/11] nfp: add devlink param infrastructure
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Register devlink parameters for driver use. Subsequent patches will add
support for specific parameters.

In order to support devlink parameters, the management firmware needs to
be able to lookup and set hwinfo keys.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/Makefile        |  1 +
 drivers/net/ethernet/netronome/nfp/devlink_param.c | 60 ++++++++++++++++++++++
 drivers/net/ethernet/netronome/nfp/nfp_main.h      |  3 ++
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c  |  7 +++
 4 files changed, 71 insertions(+)
 create mode 100644 drivers/net/ethernet/netronome/nfp/devlink_param.c

diff --git a/drivers/net/ethernet/netronome/nfp/Makefile b/drivers/net/ethernet/netronome/nfp/Makefile
index 2805641965f3..d31772ae511d 100644
--- a/drivers/net/ethernet/netronome/nfp/Makefile
+++ b/drivers/net/ethernet/netronome/nfp/Makefile
@@ -17,6 +17,7 @@ nfp-objs := \
 	    nfpcore/nfp_target.o \
 	    ccm.o \
 	    ccm_mbox.o \
+	    devlink_param.o \
 	    nfp_asm.o \
 	    nfp_app.o \
 	    nfp_app_nic.o \
diff --git a/drivers/net/ethernet/netronome/nfp/devlink_param.c b/drivers/net/ethernet/netronome/nfp/devlink_param.c
new file mode 100644
index 000000000000..aece98586e32
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/devlink_param.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+/* Copyright (C) 2019 Netronome Systems, Inc. */
+
+#include <net/devlink.h>
+
+#include "nfpcore/nfp_nsp.h"
+#include "nfp_main.h"
+
+static const struct devlink_param nfp_devlink_params[] = {
+};
+
+static int nfp_devlink_supports_params(struct nfp_pf *pf)
+{
+	struct nfp_nsp *nsp;
+	bool supported;
+	int err;
+
+	nsp = nfp_nsp_open(pf->cpp);
+	if (IS_ERR(nsp)) {
+		err = PTR_ERR(nsp);
+		dev_err(&pf->pdev->dev, "Failed to access the NSP: %d\n", err);
+		return err;
+	}
+
+	supported = nfp_nsp_has_hwinfo_lookup(nsp) &&
+		    nfp_nsp_has_hwinfo_set(nsp);
+
+	nfp_nsp_close(nsp);
+	return supported;
+}
+
+int nfp_devlink_params_register(struct nfp_pf *pf)
+{
+	struct devlink *devlink = priv_to_devlink(pf);
+	int err;
+
+	err = nfp_devlink_supports_params(pf);
+	if (err <= 0)
+		return err;
+
+	err = devlink_params_register(devlink, nfp_devlink_params,
+				      ARRAY_SIZE(nfp_devlink_params));
+	if (err)
+		return err;
+
+	devlink_params_publish(devlink);
+	return 0;
+}
+
+void nfp_devlink_params_unregister(struct nfp_pf *pf)
+{
+	int err;
+
+	err = nfp_devlink_supports_params(pf);
+	if (err <= 0)
+		return;
+
+	devlink_params_unregister(priv_to_devlink(pf), nfp_devlink_params,
+				  ARRAY_SIZE(nfp_devlink_params));
+}
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.h b/drivers/net/ethernet/netronome/nfp/nfp_main.h
index bd6450b0f23f..5d5812fd9317 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.h
@@ -187,4 +187,7 @@ int nfp_shared_buf_pool_get(struct nfp_pf *pf, unsigned int sb, u16 pool_index,
 int nfp_shared_buf_pool_set(struct nfp_pf *pf, unsigned int sb,
 			    u16 pool_index, u32 size,
 			    enum devlink_sb_threshold_type threshold_type);
+
+int nfp_devlink_params_register(struct nfp_pf *pf);
+void nfp_devlink_params_unregister(struct nfp_pf *pf);
 #endif /* NFP_MAIN_H */
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
index 986464d4a206..47addac104fe 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
@@ -711,6 +711,10 @@ int nfp_net_pci_probe(struct nfp_pf *pf)
 	if (err)
 		goto err_devlink_unreg;
 
+	err = nfp_devlink_params_register(pf);
+	if (err)
+		goto err_shared_buf_unreg;
+
 	mutex_lock(&pf->lock);
 	pf->ddir = nfp_net_debugfs_device_add(pf->pdev);
 
@@ -744,6 +748,8 @@ int nfp_net_pci_probe(struct nfp_pf *pf)
 err_clean_ddir:
 	nfp_net_debugfs_dir_clean(&pf->ddir);
 	mutex_unlock(&pf->lock);
+	nfp_devlink_params_unregister(pf);
+err_shared_buf_unreg:
 	nfp_shared_buf_unregister(pf);
 err_devlink_unreg:
 	cancel_work_sync(&pf->port_refresh_work);
@@ -773,6 +779,7 @@ void nfp_net_pci_remove(struct nfp_pf *pf)
 
 	mutex_unlock(&pf->lock);
 
+	nfp_devlink_params_unregister(pf);
 	nfp_shared_buf_unregister(pf);
 	devlink_unregister(priv_to_devlink(pf));
 
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2 08/11] nfp: devlink: add 'fw_load_policy' support
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Add support for the 'fw_load_policy' devlink parameter. The FW load
policy is controlled by the 'app_fw_from_flash' hwinfo key.

Remap the values from devlink to the hwinfo key and back.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 Documentation/networking/devlink-params-nfp.txt    |   2 +
 drivers/net/ethernet/netronome/nfp/devlink_param.c | 165 +++++++++++++++++++++
 2 files changed, 167 insertions(+)
 create mode 100644 Documentation/networking/devlink-params-nfp.txt

diff --git a/Documentation/networking/devlink-params-nfp.txt b/Documentation/networking/devlink-params-nfp.txt
new file mode 100644
index 000000000000..85b1e36f73a8
--- /dev/null
+++ b/Documentation/networking/devlink-params-nfp.txt
@@ -0,0 +1,2 @@
+fw_load_policy		[DEVICE, GENERIC]
+			Configuration mode: permanent
diff --git a/drivers/net/ethernet/netronome/nfp/devlink_param.c b/drivers/net/ethernet/netronome/nfp/devlink_param.c
index aece98586e32..d9c74cfba560 100644
--- a/drivers/net/ethernet/netronome/nfp/devlink_param.c
+++ b/drivers/net/ethernet/netronome/nfp/devlink_param.c
@@ -3,10 +3,175 @@
 
 #include <net/devlink.h>
 
+#include "nfpcore/nfp.h"
 #include "nfpcore/nfp_nsp.h"
 #include "nfp_main.h"
 
+/**
+ * struct nfp_devlink_param_u8_arg - Devlink u8 parameter get/set arguments
+ * @hwinfo_name:	HWinfo key name
+ * @default_hi_val:	Default HWinfo value if HWinfo doesn't exist
+ * @invalid_dl_val:	Devlink value to use if HWinfo is unknown/invalid.
+ *			-errno if there is no unknown/invalid value available
+ * @hi_to_dl:	HWinfo to devlink value mapping
+ * @dl_to_hi:	Devlink to hwinfo value mapping
+ * @max_dl_val:	Maximum devlink value supported, for validation only
+ * @max_hi_val:	Maximum HWinfo value supported, for validation only
+ */
+struct nfp_devlink_param_u8_arg {
+	const char *hwinfo_name;
+	const char *default_hi_val;
+	int invalid_dl_val;
+	u8 hi_to_dl[4];
+	u8 dl_to_hi[4];
+	u8 max_dl_val;
+	u8 max_hi_val;
+};
+
+static const struct nfp_devlink_param_u8_arg nfp_devlink_u8_args[] = {
+	[DEVLINK_PARAM_GENERIC_ID_FW_LOAD_POLICY] = {
+		.hwinfo_name = "app_fw_from_flash",
+		.default_hi_val = NFP_NSP_APP_FW_LOAD_DEFAULT,
+		.invalid_dl_val = -EINVAL,
+		.hi_to_dl = {
+			[NFP_NSP_APP_FW_LOAD_DISK] =
+				DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK,
+			[NFP_NSP_APP_FW_LOAD_FLASH] =
+				DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH,
+			[NFP_NSP_APP_FW_LOAD_PREF] =
+				DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER,
+		},
+		.dl_to_hi = {
+			[DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER] =
+				NFP_NSP_APP_FW_LOAD_PREF,
+			[DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH] =
+				NFP_NSP_APP_FW_LOAD_FLASH,
+			[DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK] =
+				NFP_NSP_APP_FW_LOAD_DISK,
+		},
+		.max_dl_val = DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK,
+		.max_hi_val = NFP_NSP_APP_FW_LOAD_PREF,
+	},
+};
+
+static int
+nfp_devlink_param_u8_get(struct devlink *devlink, u32 id,
+			 struct devlink_param_gset_ctx *ctx)
+{
+	const struct nfp_devlink_param_u8_arg *arg;
+	struct nfp_pf *pf = devlink_priv(devlink);
+	struct nfp_nsp *nsp;
+	char hwinfo[32];
+	long value;
+	int err;
+
+	if (id >= ARRAY_SIZE(nfp_devlink_u8_args))
+		return -EOPNOTSUPP;
+
+	arg = &nfp_devlink_u8_args[id];
+
+	nsp = nfp_nsp_open(pf->cpp);
+	if (IS_ERR(nsp)) {
+		err = PTR_ERR(nsp);
+		nfp_warn(pf->cpp, "can't access NSP: %d\n", err);
+		return err;
+	}
+
+	snprintf(hwinfo, sizeof(hwinfo), arg->hwinfo_name);
+	err = nfp_nsp_hwinfo_lookup_optional(nsp, hwinfo, sizeof(hwinfo),
+					     arg->default_hi_val);
+	if (err) {
+		nfp_warn(pf->cpp, "HWinfo lookup failed: %d\n", err);
+		goto exit_close_nsp;
+	}
+
+	err = kstrtol(hwinfo, 0, &value);
+	if (err || value < 0 || value > arg->max_hi_val) {
+		nfp_warn(pf->cpp, "HWinfo '%s' value %li invalid\n",
+			 arg->hwinfo_name, value);
+
+		if (arg->invalid_dl_val >= 0)
+			ctx->val.vu8 = arg->invalid_dl_val;
+		else
+			err = arg->invalid_dl_val;
+
+		goto exit_close_nsp;
+	}
+
+	ctx->val.vu8 = arg->hi_to_dl[value];
+
+exit_close_nsp:
+	nfp_nsp_close(nsp);
+	return err;
+}
+
+static int
+nfp_devlink_param_u8_set(struct devlink *devlink, u32 id,
+			 struct devlink_param_gset_ctx *ctx)
+{
+	const struct nfp_devlink_param_u8_arg *arg;
+	struct nfp_pf *pf = devlink_priv(devlink);
+	struct nfp_nsp *nsp;
+	char hwinfo[32];
+	int err;
+
+	if (id >= ARRAY_SIZE(nfp_devlink_u8_args))
+		return -EOPNOTSUPP;
+
+	arg = &nfp_devlink_u8_args[id];
+
+	nsp = nfp_nsp_open(pf->cpp);
+	if (IS_ERR(nsp)) {
+		err = PTR_ERR(nsp);
+		nfp_warn(pf->cpp, "can't access NSP: %d\n", err);
+		return err;
+	}
+
+	/* Note the value has already been validated. */
+	snprintf(hwinfo, sizeof(hwinfo), "%s=%u",
+		 arg->hwinfo_name, arg->dl_to_hi[ctx->val.vu8]);
+	err = nfp_nsp_hwinfo_set(nsp, hwinfo, sizeof(hwinfo));
+	if (err) {
+		nfp_warn(pf->cpp, "HWinfo set failed: %d\n", err);
+		goto exit_close_nsp;
+	}
+
+exit_close_nsp:
+	nfp_nsp_close(nsp);
+	return err;
+}
+
+static int
+nfp_devlink_param_u8_validate(struct devlink *devlink, u32 id,
+			      union devlink_param_value val,
+			      struct netlink_ext_ack *extack)
+{
+	const struct nfp_devlink_param_u8_arg *arg;
+
+	if (id >= ARRAY_SIZE(nfp_devlink_u8_args))
+		return -EOPNOTSUPP;
+
+	arg = &nfp_devlink_u8_args[id];
+
+	if (val.vu8 > arg->max_dl_val) {
+		NL_SET_ERR_MSG_MOD(extack, "parameter out of range");
+		return -EINVAL;
+	}
+
+	if (val.vu8 == arg->invalid_dl_val) {
+		NL_SET_ERR_MSG_MOD(extack, "unknown/invalid value specified");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static const struct devlink_param nfp_devlink_params[] = {
+	DEVLINK_PARAM_GENERIC(FW_LOAD_POLICY,
+			      BIT(DEVLINK_PARAM_CMODE_PERMANENT),
+			      nfp_devlink_param_u8_get,
+			      nfp_devlink_param_u8_set,
+			      nfp_devlink_param_u8_validate),
 };
 
 static int nfp_devlink_supports_params(struct nfp_pf *pf)
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2 09/11] nfp: devlink: add 'reset_dev_on_drv_probe' support
From: Simon Horman @ 2019-09-08 23:54 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190908235427.9757-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Add support for the 'reset_dev_on_drv_probe' devlink parameter. The
reset control policy is controlled by the 'abi_drv_reset' hwinfo key.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 Documentation/networking/devlink-params-nfp.txt    |  3 +++
 drivers/net/ethernet/netronome/nfp/devlink_param.c | 29 ++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/Documentation/networking/devlink-params-nfp.txt b/Documentation/networking/devlink-params-nfp.txt
index 85b1e36f73a8..43e4d4034865 100644
--- a/Documentation/networking/devlink-params-nfp.txt
+++ b/Documentation/networking/devlink-params-nfp.txt
@@ -1,2 +1,5 @@
 fw_load_policy		[DEVICE, GENERIC]
 			Configuration mode: permanent
+
+reset_dev_on_drv_probe	[DEVICE, GENERIC]
+			Configuration mode: permanent
diff --git a/drivers/net/ethernet/netronome/nfp/devlink_param.c b/drivers/net/ethernet/netronome/nfp/devlink_param.c
index d9c74cfba560..4a8141b4d625 100644
--- a/drivers/net/ethernet/netronome/nfp/devlink_param.c
+++ b/drivers/net/ethernet/netronome/nfp/devlink_param.c
@@ -52,6 +52,30 @@ static const struct nfp_devlink_param_u8_arg nfp_devlink_u8_args[] = {
 		.max_dl_val = DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK,
 		.max_hi_val = NFP_NSP_APP_FW_LOAD_PREF,
 	},
+	[DEVLINK_PARAM_GENERIC_ID_RESET_DEV_ON_DRV_PROBE] = {
+		.hwinfo_name = "abi_drv_reset",
+		.default_hi_val = NFP_NSP_DRV_RESET_DEFAULT,
+		.invalid_dl_val =
+			DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_UNKNOWN,
+		.hi_to_dl = {
+			[NFP_NSP_DRV_RESET_ALWAYS] =
+				DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS,
+			[NFP_NSP_DRV_RESET_NEVER] =
+				DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_NEVER,
+			[NFP_NSP_DRV_RESET_DISK] =
+				DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK,
+		},
+		.dl_to_hi = {
+			[DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS] =
+				NFP_NSP_DRV_RESET_ALWAYS,
+			[DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_NEVER] =
+				NFP_NSP_DRV_RESET_NEVER,
+			[DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK] =
+				NFP_NSP_DRV_RESET_DISK,
+		},
+		.max_dl_val = DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK,
+		.max_hi_val = NFP_NSP_DRV_RESET_NEVER,
+	}
 };
 
 static int
@@ -172,6 +196,11 @@ static const struct devlink_param nfp_devlink_params[] = {
 			      nfp_devlink_param_u8_get,
 			      nfp_devlink_param_u8_set,
 			      nfp_devlink_param_u8_validate),
+	DEVLINK_PARAM_GENERIC(RESET_DEV_ON_DRV_PROBE,
+			      BIT(DEVLINK_PARAM_CMODE_PERMANENT),
+			      nfp_devlink_param_u8_get,
+			      nfp_devlink_param_u8_set,
+			      nfp_devlink_param_u8_validate),
 };
 
 static int nfp_devlink_supports_params(struct nfp_pf *pf)
-- 
2.11.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox