Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH V6 net-next 08/15] net/tls: Support TLS device offload with IPv6
From: David Miller @ 2018-04-24  0:56 UTC (permalink / raw)
  To: borisp; +Cc: netdev, saeedm, davejwatson, ktkhai, ilyal
In-Reply-To: <1524410397-108425-9-git-send-email-borisp@mellanox.com>

From: Boris Pismenny <borisp@mellanox.com>
Date: Sun, 22 Apr 2018 18:19:50 +0300

> @@ -97,13 +102,57 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
>  	spin_unlock_irqrestore(&tls_device_lock, flags);
>  }
>  
> +#if IS_ENABLED(CONFIG_IPV6)
> +static struct net_device *ipv6_get_netdev(struct sock *sk)
> +{
> +	struct net_device *dev = NULL;
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct ipv6_pinfo *np = inet6_sk(sk);
> +	struct flowi6 _fl6, *fl6 = &_fl6;
> +	struct dst_entry *dst;

Ugh, please use sk->sk_dst_cache->dev and avoid all of the unnecessary
work.

Thank you.

^ permalink raw reply

* Re: [PATCH V6 net-next 07/15] net/tls: Add generic NIC offload infrastructure
From: David Miller @ 2018-04-24  0:55 UTC (permalink / raw)
  To: borisp; +Cc: netdev, saeedm, davejwatson, ktkhai, ilyal, aviadye
In-Reply-To: <1524410397-108425-8-git-send-email-borisp@mellanox.com>

From: Boris Pismenny <borisp@mellanox.com>
Date: Sun, 22 Apr 2018 18:19:49 +0300

> +/* We assume that the socket is already connected */
> +static struct net_device *get_netdev_for_sock(struct sock *sk)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct net_device *netdev = NULL;
> +
> +	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +
> +	return netdev;
> +}

Please do this more directly by looking at sk->sk_dst_cache and taking
the device from dst->dev if non-NULL.

That avoids the dev_get_by_index() demux work, and also if
sk->sk_dst_cache is non-NULL then the socket is indeed connected.

Thanks.

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Jason Wang @ 2018-04-24  0:54 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180423092908.77rii3gi7dcaf7o6@debian>



On 2018年04月23日 17:29, Tiwei Bie wrote:
> On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
>> On 2018年04月01日 22:12, Tiwei Bie wrote:
>>> Hello everyone,
>>>
>>> This RFC implements packed ring support for virtio driver.
>>>
>>> The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
>>> by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
>>> Minor changes are needed for the vhost code, e.g. to kick the guest.
>>>
>>> TODO:
>>> - Refinements and bug fixes;
>>> - Split into small patches;
>>> - Test indirect descriptor support;
>>> - Test/fix event suppression support;
>>> - Test devices other than net;
>>>
>>> RFC v1 -> RFC v2:
>>> - Add indirect descriptor support - compile test only;
>>> - Add event suppression supprt - compile test only;
>>> - Move vring_packed_init() out of uapi (Jason, MST);
>>> - Merge two loops into one in virtqueue_add_packed() (Jason);
>>> - Split vring_unmap_one() for packed ring and split ring (Jason);
>>> - Avoid using '%' operator (Jason);
>>> - Rename free_head -> next_avail_idx (Jason);
>>> - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
>>> - Some other refinements and bug fixes;
>>>
>>> Thanks!
>>>
>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>> ---
>>>    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
>>>    include/linux/virtio_ring.h        |    8 +-
>>>    include/uapi/linux/virtio_config.h |   12 +-
>>>    include/uapi/linux/virtio_ring.h   |   61 ++
>>>    4 files changed, 980 insertions(+), 195 deletions(-)
>>>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 71458f493cf8..0515dca34d77 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -58,14 +58,15 @@
>> [...]
>>
>>> +
>>> +	if (vq->indirect) {
>>> +		u32 len;
>>> +
>>> +		desc = vq->desc_state[head].indir_desc;
>>> +		/* Free the indirect table, if any, now that it's unmapped. */
>>> +		if (!desc)
>>> +			goto out;
>>> +
>>> +		len = virtio32_to_cpu(vq->vq.vdev,
>>> +				      vq->vring_packed.desc[head].len);
>>> +
>>> +		BUG_ON(!(vq->vring_packed.desc[head].flags &
>>> +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
>> It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
>> can safely remove this BUG_ON() here.
>>
>>> +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
>> Len could be ignored for used descriptor according to the spec, so we need
>> remove this BUG_ON() too.
> Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> And I think something related to this in the spec isn't very
> clear currently.
>
> In the spec, there are below words:
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> """
> In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> is reserved and is ignored by the device.
> """
>
> So when device writes back an used descriptor in this case,
> device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> is reserved and should be ignored.
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> """
> Element Length is reserved for used descriptors without the
> VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> """
>
> And this is the way how driver ignores the `len` in an used
> descriptor.
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> """
> To increase ring capacity the driver can store a (read-only
> by the device) table of indirect descriptors anywhere in memory,
> and insert a descriptor in the main virtqueue (with \field{Flags}
> bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> containing this indirect descriptor table;
> """
>
> So the indirect descriptors in the table are read-only by
> the device. And the only descriptor which is writeable by
> the device is the descriptor in the main virtqueue (with
> Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> `len` in this descriptor, we won't be able to get the
> length of the data written by the device.
>
> So I think the `len` in this descriptor will carry the
> length of the data written by the device (if the buffers
> are writable to the device) even if the VIRTQ_DESC_F_WRITE
> isn't set by the device. How do you think?

Yes I think so. But we'd better need clarification from Michael.

>
>
>> The reason is we don't touch descriptor ring in the case of split, so
>> BUG_ON()s may help there.
>>
>>> +
>>> +		for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
>>> +			vring_unmap_one_packed(vq, &desc[j]);
>>> +
>>> +		kfree(desc);
>>> +		vq->desc_state[head].indir_desc = NULL;
>>> +	} else if (ctx) {
>>> +		*ctx = vq->desc_state[head].indir_desc;
>>> +	}
>>> +
>>> +out:
>>> +	return vq->desc_state[head].num;
>>> +}
>>> +
>>> +static inline bool more_used_split(const struct vring_virtqueue *vq)
>>>    {
>>>    	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
>>>    }
>>> +static inline bool more_used_packed(const struct vring_virtqueue *vq)
>>> +{
>>> +	u16 last_used, flags;
>>> +	bool avail, used;
>>> +
>>> +	if (vq->vq.num_free == vq->vring_packed.num)
>>> +		return false;
>>> +
>>> +	last_used = vq->last_used_idx;
>>> +	flags = virtio16_to_cpu(vq->vq.vdev,
>>> +				vq->vring_packed.desc[last_used].flags);
>>> +	avail = flags & VRING_DESC_F_AVAIL(1);
>>> +	used = flags & VRING_DESC_F_USED(1);
>>> +
>>> +	return avail == used;
>>> +}
>> This looks interesting, spec said:
>>
>> "
>> Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
>> available descriptor and
>> equal for a used descriptor.
>> Note that this observation is mostly useful for sanity-checking as these are
>> necessary but not sufficient
>> conditions - for example, all descriptors are zero-initialized. To detect
>> used and available descriptors it is
>> possible for drivers and devices to keep track of the last observed value of
>> VIRTQ_DESC_F_USED/VIRTQ_-
>> DESC_F_AVAIL. Other techniques to detect
>> VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
>> might also be possible.
>> "
>>
>> So it looks to me it was not sufficient, looking at the example codes in
>> spec, do we need to track last seen used_wrap_counter here?
> I don't think we have to track used_wrap_counter in
> driver. There was a discussion on this:
>
> https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
>
> And after that, below sentence was added (it's also
> in the above words you quoted):
>
> """
> Other techniques to detect
> VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> might also be possible.
> """
>
> Best regards,
> Tiwei Bie

I see, the extra condition "if (vq->vq.num_free == 
vq->vring_packed.num)" help in this case.

Thanks

>
>> Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-24  0:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180423151545.GU17484@dhcp22.suse.cz>



On Mon, 23 Apr 2018, Michal Hocko wrote:

> On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> 
> > > > He didn't want to fix vmalloc(GFP_NOIO)
> > > 
> > > I don't remember that conversation, so I don't know whether I agree with
> > > his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
> > > towards marking regions with memalloc_noio_save() / restore.  If you do
> > > that, you won't need vmalloc(GFP_NOIO).
> > 
> > He said the same thing a year ago. And there was small progress. 6 out of 
> > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in 
> > infiniband and 1 in btrfs. (the whole discussion is here 
> > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
> 
> Well this is not that easy. It requires a cooperation from maintainers.
> I can only do as much. I've posted patches in the past and actively
> bringing up this topic at LSFMM last two years...

You're right - but you have chosen the uneasy path. Fixing __vmalloc code 
is easy and it doesn't require cooperation with maintainers.

> > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4 
> > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why 
> > does he have veto over this part of the code? I'd much rather argue with 
> > people who have constructive comments about fixing bugs than with him.
> 
> I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> I would be much more willing to change my mind if you would back your
> patch by a real bug report. Hacks are acceptable when we have a real
> issue in hands. But if we want to fix potential issue then better make
> it properly.

Developers should fix bugs in advance, not to wait until a crash hapens, 
is analyzed and reported.

What's the problem with 15-line hack? Is the problem that kernel 
developers would feel depressed when looking the source code? Other than 
harming developers' feelings, I don't see what kind of damange could that 
piece of code do.

Mikulas

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Yonghong Song @ 2018-04-24  0:06 UTC (permalink / raw)
  To: Peter Zijlstra, Song Liu
  Cc: netdev, ast, daniel, kernel-team, hannes, qinteng
In-Reply-To: <20180312220132.GH4043@hirez.programming.kicks-ass.net>


Hi, Peter,

I have a question regarding to one of your comments below.

On 3/12/18 3:01 PM, Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 01:39:56PM -0700, Song Liu wrote:
>> +static void stack_map_get_build_id_offset(struct bpf_map *map,
>> +					  struct stack_map_bucket *bucket,
>> +					  u64 *ips, u32 trace_nr)
>> +{
>> +	int i;
>> +	struct vm_area_struct *vma;
>> +	struct bpf_stack_build_id *id_offs;
>> +
>> +	bucket->nr = trace_nr;
>> +	id_offs = (struct bpf_stack_build_id *)bucket->data;
>> +
>> +	if (!current || !current->mm ||
>> +	    down_read_trylock(&current->mm->mmap_sem) == 0) {
> 
> You probably want an in_nmi() before the down_read_trylock(). Doing
> up_read() is an absolute no-no from NMI context.

The below is the final code from Song:

         /*
          * We cannot do up_read() in nmi context, so build_id lookup is
          * only supported for non-nmi events. If at some point, it is
          * possible to run find_vma() without taking the semaphore, we
          * would like to allow build_id lookup in nmi context.
          *
          * Same fallback is used for kernel stack (!user) on a stackmap
          * with build_id.
          */
         if (!user || !current || !current->mm || in_nmi() ||
             down_read_trylock(&current->mm->mmap_sem) == 0) {
                 /* cannot access current->mm, fall back to ips */
                 for (i = 0; i < trace_nr; i++) {
                         id_offs[i].status = BPF_STACK_BUILD_ID_IP;
                         id_offs[i].ip = ips[i];
                 }
                 return;
         }

         ....

> 
> And IIUC its 'trivial' to use this stuff with hardware counters.

Here, you mentioned that it was 'trivial' to use buildid thing with 
hardware counters, if I interpreted correctly. However, the hardware 
counter overflow will trigger NMI. Based on the above logic,
it will default to old IP only behavior.

Could you explain a little more how to get buildid with hardware
counter overflow events?

Thanks!

> 
>> +		/* cannot access current->mm, fall back to ips */
>> +		for (i = 0; i < trace_nr; i++) {
>> +			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
>> +			id_offs[i].ip = ips[i];
>> +		}
>> +		return;
>> +	}
>> +
>> +	for (i = 0; i < trace_nr; i++) {
>> +		vma = find_vma(current->mm, ips[i]);
>> +		if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
>> +			/* per entry fall back to ips */
>> +			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
>> +			id_offs[i].ip = ips[i];
>> +			continue;
>> +		}
>> +		id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
>> +			- vma->vm_start;
>> +		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
>> +	}
>> +	up_read(&current->mm->mmap_sem);
>> +}
> 

^ permalink raw reply

* [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24  0:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180423151545.GU17484@dhcp22.suse.cz>

The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.

Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.

These bugs are hard to reproduce because kvmalloc falls back to vmalloc
only if memory is fragmented.

In order to detect these bugs reliably I submit this patch that changes
kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
verify the addresses passed to it, and so the user will get a reliable
stacktrace.

Some bugs (such as buffer overflows) are better detected
with kmalloc code, so we must test the kmalloc path too.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/util.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c	2018-04-23 00:12:05.000000000 +0200
+++ linux-2.6/mm/util.c	2018-04-23 17:57:02.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/hugetlb.h>
 #include <linux/vmalloc.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/random.h>
 
 #include <asm/sections.h>
 #include <linux/uaccess.h>
@@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
 	 */
 	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
+#ifdef CONFIG_DEBUG_SG
+	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
+	if (!(prandom_u32_max(2) & 1))
+		goto do_vmalloc;
+#endif
+
 	/*
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
+#ifdef CONFIG_DEBUG_SG
+do_vmalloc:
+#endif
 	return __vmalloc_node_flags_caller(size, node, flags,
 			__builtin_return_address(0));
 }

^ permalink raw reply

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
From: Willem de Bruijn @ 2018-04-23 23:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Daniel Borkmann, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180424021712-mutt-send-email-mst@kernel.org>

On Mon, Apr 23, 2018 at 7:21 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, we add another setsockopt for registered user memory (umem)
>> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
>> ask the kernel to allocate a queue (ring buffer) and also mmap it
>> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
>>
>> The queue is used to explicitly pass ownership of umem frames from the
>> user process to the kernel. These frames will in a later patch be
>> filled in with Rx packet data by the kernel.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>>  net/xdp/Makefile            |  2 +-
>>  net/xdp/xdp_umem.c          |  5 ++++
>>  net/xdp/xdp_umem.h          |  2 ++
>>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>>  7 files changed, 180 insertions(+), 2 deletions(-)
>>  create mode 100644 net/xdp/xsk_queue.c
>>  create mode 100644 net/xdp/xsk_queue.h
>>
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index 41252135a0fe..975661e1baca 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -23,6 +23,7 @@
>>
>>  /* XDP socket options */
>>  #define XDP_UMEM_REG                 3
>> +#define XDP_UMEM_FILL_RING           4
>>
>>  struct xdp_umem_reg {
>>       __u64 addr; /* Start of packet data area */
>> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>>       __u32 frame_headroom; /* Frame head room */
>>  };
>>
>> +/* Pgoff for mmaping the rings */
>> +#define XDP_UMEM_PGOFF_FILL_RING     0x100000000
>> +
>> +struct xdp_ring {
>> +     __u32 producer __attribute__((aligned(64)));
>> +     __u32 consumer __attribute__((aligned(64)));
>> +};
>
> Why 64? And do you still need these guys in uapi?

I was just about to ask the same. You mean cacheline_aligned?

>> +static int xsk_mmap(struct file *file, struct socket *sock,
>> +                 struct vm_area_struct *vma)
>> +{
>> +     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>> +     unsigned long size = vma->vm_end - vma->vm_start;
>> +     struct xdp_sock *xs = xdp_sk(sock->sk);
>> +     struct xsk_queue *q;
>> +     unsigned long pfn;
>> +     struct page *qpg;
>> +
>> +     if (!xs->umem)
>> +             return -EINVAL;
>> +
>> +     if (offset == XDP_UMEM_PGOFF_FILL_RING)
>> +             q = xs->umem->fq;
>> +     else
>> +             return -EINVAL;
>> +
>> +     qpg = virt_to_head_page(q->ring);

Is it assured that q is initialized with a call to setsockopt
XDP_UMEM_FILL_RING before the call the mmap?

In general, with such an extensive new API, it might be worthwhile to
run syzkaller locally on a kernel with these patches. It is pretty
easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
though it also needs to be taught about any new APIs.

^ permalink raw reply

* Re: [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets
From: Michael S. Tsirkin @ 2018-04-23 23:31 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, Björn Töpel
In-Reply-To: <20180423135619.7179-16-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 03:56:19PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> This is a sample application for AF_XDP sockets. The application
> supports three different modes of operation: rxdrop, txonly and l2fwd.
> 
> To show-case a simple round-robin load-balancing between a set of
> sockets in an xskmap, set the RR_LB compile time define option to 1 in
> "xdpsock.h".
> 
> Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  samples/bpf/Makefile       |   4 +
>  samples/bpf/xdpsock.h      |  11 +
>  samples/bpf/xdpsock_kern.c |  56 +++
>  samples/bpf/xdpsock_user.c | 947 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 1018 insertions(+)
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index aa8c392e2e52..d0ddc1abf20d 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
>  hostprogs-y += syscall_tp
>  hostprogs-y += cpustat
>  hostprogs-y += xdp_adjust_tail
> +hostprogs-y += xdpsock
>  
>  # Libbpf dependencies
>  LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
> @@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
>  syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
>  cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
>  xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
> +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
>  
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
>  always += syscall_tp_kern.o
>  always += cpustat_kern.o
>  always += xdp_adjust_tail_kern.o
> +always += xdpsock_kern.o
>  
>  HOSTCFLAGS += -I$(objtree)/usr/include
>  HOSTCFLAGS += -I$(srctree)/tools/lib/
> @@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
>  HOSTLOADLIBES_syscall_tp += -lelf
>  HOSTLOADLIBES_cpustat += -lelf
>  HOSTLOADLIBES_xdp_adjust_tail += -lelf
> +HOSTLOADLIBES_xdpsock += -lelf -pthread
>  
>  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
> diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
> new file mode 100644
> index 000000000000..533ab81adfa1
> --- /dev/null
> +++ b/samples/bpf/xdpsock.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef XDPSOCK_H_
> +#define XDPSOCK_H_
> +
> +/* Power-of-2 number of sockets */
> +#define MAX_SOCKS 4
> +
> +/* Round-robin receive */
> +#define RR_LB 0
> +
> +#endif /* XDPSOCK_H_ */
> diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
> new file mode 100644
> index 000000000000..d8806c41362e
> --- /dev/null
> +++ b/samples/bpf/xdpsock_kern.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define KBUILD_MODNAME "foo"
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +#include "xdpsock.h"
> +
> +struct bpf_map_def SEC("maps") qidconf_map = {
> +	.type		= BPF_MAP_TYPE_ARRAY,
> +	.key_size	= sizeof(int),
> +	.value_size	= sizeof(int),
> +	.max_entries	= 1,
> +};
> +
> +struct bpf_map_def SEC("maps") xsks_map = {
> +	.type = BPF_MAP_TYPE_XSKMAP,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(int),
> +	.max_entries = 4,
> +};
> +
> +struct bpf_map_def SEC("maps") rr_map = {
> +	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(unsigned int),
> +	.max_entries = 1,
> +};
> +
> +SEC("xdp_sock")
> +int xdp_sock_prog(struct xdp_md *ctx)
> +{
> +	int *qidconf, key = 0, idx;
> +	unsigned int *rr;
> +
> +	qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
> +	if (!qidconf)
> +		return XDP_ABORTED;
> +
> +	if (*qidconf != ctx->rx_queue_index)
> +		return XDP_PASS;
> +
> +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
> +	rr = bpf_map_lookup_elem(&rr_map, &key);
> +	if (!rr)
> +		return XDP_ABORTED;
> +
> +	*rr = (*rr + 1) & (MAX_SOCKS - 1);
> +	idx = *rr;
> +#else
> +	idx = 0;
> +#endif
> +
> +	return bpf_redirect_map(&xsks_map, idx, 0);
> +}
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> new file mode 100644
> index 000000000000..690bac1a0ab7
> --- /dev/null
> +++ b/samples/bpf/xdpsock_user.c
> @@ -0,0 +1,947 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2017 - 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <assert.h>
> +#include <errno.h>
> +#include <getopt.h>
> +#include <libgen.h>
> +#include <linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <linux/if_xdp.h>
> +#include <linux/if_ether.h>
> +#include <net/if.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <net/ethernet.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/mman.h>
> +#include <time.h>
> +#include <unistd.h>
> +#include <pthread.h>
> +#include <locale.h>
> +#include <sys/types.h>
> +#include <poll.h>
> +
> +#include "bpf_load.h"
> +#include "bpf_util.h"
> +#include "libbpf.h"
> +
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +
> +#ifndef AF_XDP
> +#define AF_XDP 44
> +#endif
> +
> +#ifndef PF_XDP
> +#define PF_XDP AF_XDP
> +#endif
> +
> +#define NUM_FRAMES 131072
> +#define FRAME_HEADROOM 0
> +#define FRAME_SIZE 2048
> +#define NUM_DESCS 1024
> +#define BATCH_SIZE 16
> +
> +#define FQ_NUM_DESCS 1024
> +#define CQ_NUM_DESCS 1024
> +
> +#define DEBUG_HEXDUMP 0
> +
> +typedef __u32 u32;
> +
> +static unsigned long prev_time;
> +
> +enum benchmark_type {
> +	BENCH_RXDROP = 0,
> +	BENCH_TXONLY = 1,
> +	BENCH_L2FWD = 2,
> +};
> +
> +static enum benchmark_type opt_bench = BENCH_RXDROP;
> +static u32 opt_xdp_flags;
> +static const char *opt_if = "";
> +static int opt_ifindex;
> +static int opt_queue;
> +static int opt_poll;
> +static int opt_shared_packet_buffer;
> +static int opt_interval = 1;
> +
> +struct xdp_umem_uqueue {
> +	u32 cached_prod;
> +	u32 cached_cons;
> +	u32 mask;
> +	u32 size;
> +	struct xdp_umem_ring *ring;
> +};
> +
> +struct xdp_umem {
> +	char (*frames)[FRAME_SIZE];
> +	struct xdp_umem_uqueue fq;
> +	struct xdp_umem_uqueue cq;
> +	int fd;
> +};
> +
> +struct xdp_uqueue {
> +	u32 cached_prod;
> +	u32 cached_cons;
> +	u32 mask;
> +	u32 size;
> +	struct xdp_rxtx_ring *ring;
> +};
> +
> +struct xdpsock {
> +	struct xdp_uqueue rx;
> +	struct xdp_uqueue tx;
> +	int sfd;
> +	struct xdp_umem *umem;
> +	u32 outstanding_tx;
> +	unsigned long rx_npkts;
> +	unsigned long tx_npkts;
> +	unsigned long prev_rx_npkts;
> +	unsigned long prev_tx_npkts;
> +};
> +
> +#define MAX_SOCKS 4
> +static int num_socks;
> +struct xdpsock *xsks[MAX_SOCKS];
> +
> +static unsigned long get_nsecs(void)
> +{
> +	struct timespec ts;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &ts);
> +	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
> +}
> +
> +static void dump_stats(void);
> +
> +#define lassert(expr)							\
> +	do {								\
> +		if (!(expr)) {						\
> +			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
> +				#expr ": errno: %d/\"%s\"\n",		\
> +				__FILE__, __func__, __LINE__,		\
> +				errno, strerror(errno));		\
> +			dump_stats();					\
> +			exit(EXIT_FAILURE);				\
> +		}							\
> +	} while (0)
> +
> +#define barrier() __asm__ __volatile__("": : :"memory")
> +#define u_smp_rmb() barrier()
> +#define u_smp_wmb() barrier()
> +#define likely(x) __builtin_expect(!!(x), 1)
> +#define unlikely(x) __builtin_expect(!!(x), 0)
> +
> +static const char pkt_data[] =
> +	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
> +	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
> +	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
> +	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
> +
> +static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
> +{
> +	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
> +
> +	if (free_entries >= nb)
> +		return free_entries;
> +
> +	/* Refresh the local tail pointer */
> +	q->cached_cons = q->ring->ptrs.consumer;
> +
> +	return q->size - (q->cached_prod - q->cached_cons);
> +}
> +
> +static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
> +{
> +	u32 free_entries = q->cached_cons - q->cached_prod;
> +
> +	if (free_entries >= ndescs)
> +		return free_entries;
> +
> +	/* Refresh the local tail pointer */
> +	q->cached_cons = q->ring->ptrs.consumer + q->size;
> +	return q->cached_cons - q->cached_prod;
> +}
> +
> +static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
> +{
> +	u32 entries = q->cached_prod - q->cached_cons;
> +
> +	if (entries == 0)
> +		q->cached_prod = q->ring->ptrs.producer;
> +
> +	entries = q->cached_prod - q->cached_cons;
> +
> +	return (entries > nb) ? nb : entries;
> +}
> +
> +static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
> +{
> +	u32 entries = q->cached_prod - q->cached_cons;
> +
> +	if (entries == 0)
> +		q->cached_prod = q->ring->ptrs.producer;
> +
> +	entries = q->cached_prod - q->cached_cons;
> +	return (entries > ndescs) ? ndescs : entries;
> +}
> +
> +static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
> +					 struct xdp_desc *d,
> +					 size_t nb)
> +{
> +	u32 i;
> +
> +	if (umem_nb_free(fq, nb) < nb)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < nb; i++) {
> +		u32 idx = fq->cached_prod++ & fq->mask;
> +
> +		fq->ring->desc[idx] = d[i].idx;
> +	}
> +
> +	u_smp_wmb();
> +
> +	fq->ring->ptrs.producer = fq->cached_prod;
> +
> +	return 0;
> +}
> +
> +static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
> +				      size_t nb)
> +{
> +	u32 i;
> +
> +	if (umem_nb_free(fq, nb) < nb)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < nb; i++) {
> +		u32 idx = fq->cached_prod++ & fq->mask;
> +
> +		fq->ring->desc[idx] = d[i];
> +	}
> +
> +	u_smp_wmb();
> +
> +	fq->ring->ptrs.producer = fq->cached_prod;
> +
> +	return 0;
> +}
> +
> +static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
> +					       u32 *d, size_t nb)
> +{
> +	u32 idx, i, entries = umem_nb_avail(cq, nb);
> +
> +	u_smp_rmb();
> +
> +	for (i = 0; i < entries; i++) {
> +		idx = cq->cached_cons++ & cq->mask;
> +		d[i] = cq->ring->desc[idx];
> +	}
> +
> +	if (entries > 0) {
> +		u_smp_wmb();
> +
> +		cq->ring->ptrs.consumer = cq->cached_cons;
> +	}
> +
> +	return entries;
> +}
> +
> +static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
> +{
> +	lassert(idx < NUM_FRAMES);
> +	return &xsk->umem->frames[idx][off];
> +}
> +
> +static inline int xq_enq(struct xdp_uqueue *uq,
> +			 const struct xdp_desc *descs,
> +			 unsigned int ndescs)
> +{
> +	struct xdp_rxtx_ring *r = uq->ring;
> +	unsigned int i;
> +
> +	if (xq_nb_free(uq, ndescs) < ndescs)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < ndescs; i++) {
> +		u32 idx = uq->cached_prod++ & uq->mask;
> +
> +		r->desc[idx].idx = descs[i].idx;
> +		r->desc[idx].len = descs[i].len;
> +		r->desc[idx].offset = descs[i].offset;
> +	}
> +
> +	u_smp_wmb();
> +
> +	r->ptrs.producer = uq->cached_prod;
> +	return 0;
> +}
> +
> +static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
> +				 __u32 idx, unsigned int ndescs)
> +{
> +	struct xdp_rxtx_ring *q = uq->ring;
> +	unsigned int i;
> +
> +	if (xq_nb_free(uq, ndescs) < ndescs)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < ndescs; i++) {
> +		u32 idx = uq->cached_prod++ & uq->mask;
> +
> +		q->desc[idx].idx	= idx + i;
> +		q->desc[idx].len	= sizeof(pkt_data) - 1;
> +		q->desc[idx].offset	= 0;
> +	}
> +
> +	u_smp_wmb();
> +
> +	q->ptrs.producer = uq->cached_prod;
> +	return 0;
> +}
> +
> +static inline int xq_deq(struct xdp_uqueue *uq,
> +			 struct xdp_desc *descs,
> +			 int ndescs)
> +{
> +	struct xdp_rxtx_ring *r = uq->ring;
> +	unsigned int idx;
> +	int i, entries;
> +
> +	entries = xq_nb_avail(uq, ndescs);
> +
> +	u_smp_rmb();
> +
> +	for (i = 0; i < entries; i++) {
> +		idx = uq->cached_cons++ & uq->mask;
> +		descs[i] = r->desc[idx];
> +	}
> +
> +	if (entries > 0) {
> +		u_smp_wmb();
> +
> +		r->ptrs.consumer = uq->cached_cons;
> +	}
> +
> +	return entries;
> +}

Interesting, I was under the impression that you were
planning to get rid of consumer/producer counters
and validate the descriptors instead.

That's the ptr_ring design.

You can then drop all the code around synchronising
counter caches, as well as smp_rmb barriers.


> +
> +static void swap_mac_addresses(void *data)
> +{
> +	struct ether_header *eth = (struct ether_header *)data;
> +	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
> +	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
> +	struct ether_addr tmp;
> +
> +	tmp = *src_addr;
> +	*src_addr = *dst_addr;
> +	*dst_addr = tmp;
> +}
> +
> +#if DEBUG_HEXDUMP
> +static void hex_dump(void *pkt, size_t length, const char *prefix)
> +{
> +	int i = 0;
> +	const unsigned char *address = (unsigned char *)pkt;
> +	const unsigned char *line = address;
> +	size_t line_size = 32;
> +	unsigned char c;
> +
> +	printf("length = %zu\n", length);
> +	printf("%s | ", prefix);
> +	while (length-- > 0) {
> +		printf("%02X ", *address++);
> +		if (!(++i % line_size) || (length == 0 && i % line_size)) {
> +			if (length == 0) {
> +				while (i++ % line_size)
> +					printf("__ ");
> +			}
> +			printf(" | ");	/* right close */
> +			while (line < address) {
> +				c = *line++;
> +				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
> +			}
> +			printf("\n");
> +			if (length > 0)
> +				printf("%s | ", prefix);
> +		}
> +	}
> +	printf("\n");
> +}
> +#endif
> +
> +static size_t gen_eth_frame(char *frame)
> +{
> +	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
> +	return sizeof(pkt_data) - 1;
> +}
> +
> +static struct xdp_umem *xdp_umem_configure(int sfd)
> +{
> +	int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
> +	struct xdp_umem_reg mr;
> +	struct xdp_umem *umem;
> +	void *bufs;
> +
> +	umem = calloc(1, sizeof(*umem));
> +	lassert(umem);
> +
> +	lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
> +			       NUM_FRAMES * FRAME_SIZE) == 0);
> +
> +	mr.addr = (__u64)bufs;
> +	mr.len = NUM_FRAMES * FRAME_SIZE;
> +	mr.frame_size = FRAME_SIZE;
> +	mr.frame_headroom = FRAME_HEADROOM;
> +
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
> +			   sizeof(int)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
> +			   sizeof(int)) == 0);
> +
> +	umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
> +			     FQ_NUM_DESCS * sizeof(u32),
> +			     PROT_READ | PROT_WRITE,
> +			     MAP_SHARED | MAP_POPULATE, sfd,
> +			     XDP_UMEM_PGOFF_FILL_RING);
> +	lassert(umem->fq.ring != MAP_FAILED);
> +
> +	umem->fq.mask = FQ_NUM_DESCS - 1;
> +	umem->fq.size = FQ_NUM_DESCS;
> +
> +	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
> +			     CQ_NUM_DESCS * sizeof(u32),
> +			     PROT_READ | PROT_WRITE,
> +			     MAP_SHARED | MAP_POPULATE, sfd,
> +			     XDP_UMEM_PGOFF_COMPLETION_RING);
> +	lassert(umem->cq.ring != MAP_FAILED);
> +
> +	umem->cq.mask = CQ_NUM_DESCS - 1;
> +	umem->cq.size = CQ_NUM_DESCS;
> +
> +	umem->frames = (char (*)[FRAME_SIZE])bufs;
> +	umem->fd = sfd;
> +
> +	if (opt_bench == BENCH_TXONLY) {
> +		int i;
> +
> +		for (i = 0; i < NUM_FRAMES; i++)
> +			(void)gen_eth_frame(&umem->frames[i][0]);
> +	}
> +
> +	return umem;
> +}
> +
> +static struct xdpsock *xsk_configure(struct xdp_umem *umem)
> +{
> +	struct sockaddr_xdp sxdp = {};
> +	int sfd, ndescs = NUM_DESCS;
> +	struct xdpsock *xsk;
> +	bool shared = true;
> +	u32 i;
> +
> +	sfd = socket(PF_XDP, SOCK_RAW, 0);
> +	lassert(sfd >= 0);
> +
> +	xsk = calloc(1, sizeof(*xsk));
> +	lassert(xsk);
> +
> +	xsk->sfd = sfd;
> +	xsk->outstanding_tx = 0;
> +
> +	if (!umem) {
> +		shared = false;
> +		xsk->umem = xdp_umem_configure(sfd);
> +	} else {
> +		xsk->umem = umem;
> +	}
> +
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
> +			   &ndescs, sizeof(int)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
> +			   &ndescs, sizeof(int)) == 0);
> +
> +	/* Rx */
> +	xsk->rx.ring = mmap(NULL,
> +			    sizeof(struct xdp_ring) +
> +			    NUM_DESCS * sizeof(struct xdp_desc),
> +			    PROT_READ | PROT_WRITE,
> +			    MAP_SHARED | MAP_POPULATE, sfd,
> +			    XDP_PGOFF_RX_RING);
> +	lassert(xsk->rx.ring != MAP_FAILED);
> +
> +	if (!shared) {
> +		for (i = 0; i < NUM_DESCS / 2; i++)
> +			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
> +				== 0);
> +	}
> +
> +	/* Tx */
> +	xsk->tx.ring = mmap(NULL,
> +			 sizeof(struct xdp_ring) +
> +			 NUM_DESCS * sizeof(struct xdp_desc),
> +			 PROT_READ | PROT_WRITE,
> +			 MAP_SHARED | MAP_POPULATE, sfd,
> +			 XDP_PGOFF_TX_RING);
> +	lassert(xsk->tx.ring != MAP_FAILED);
> +
> +	xsk->rx.mask = NUM_DESCS - 1;
> +	xsk->rx.size = NUM_DESCS;
> +
> +	xsk->tx.mask = NUM_DESCS - 1;
> +	xsk->tx.size = NUM_DESCS;
> +
> +	sxdp.sxdp_family = PF_XDP;
> +	sxdp.sxdp_ifindex = opt_ifindex;
> +	sxdp.sxdp_queue_id = opt_queue;
> +	if (shared) {
> +		sxdp.sxdp_flags = XDP_SHARED_UMEM;
> +		sxdp.sxdp_shared_umem_fd = umem->fd;
> +	}
> +
> +	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
> +
> +	return xsk;
> +}
> +
> +static void print_benchmark(bool running)
> +{
> +	const char *bench_str = "INVALID";
> +
> +	if (opt_bench == BENCH_RXDROP)
> +		bench_str = "rxdrop";
> +	else if (opt_bench == BENCH_TXONLY)
> +		bench_str = "txonly";
> +	else if (opt_bench == BENCH_L2FWD)
> +		bench_str = "l2fwd";
> +
> +	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
> +	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
> +		printf("xdp-skb ");
> +	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
> +		printf("xdp-drv ");
> +	else
> +		printf("	");
> +
> +	if (opt_poll)
> +		printf("poll() ");
> +
> +	if (running) {
> +		printf("running...");
> +		fflush(stdout);
> +	}
> +}
> +
> +static void dump_stats(void)
> +{
> +	unsigned long now = get_nsecs();
> +	long dt = now - prev_time;
> +	int i;
> +
> +	prev_time = now;
> +
> +	for (i = 0; i < num_socks; i++) {
> +		char *fmt = "%-15s %'-11.0f %'-11lu\n";
> +		double rx_pps, tx_pps;
> +
> +		rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
> +			 1000000000. / dt;
> +		tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
> +			 1000000000. / dt;
> +
> +		printf("\n sock%d@", i);
> +		print_benchmark(false);
> +		printf("\n");
> +
> +		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
> +		       dt / 1000000000.);
> +		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
> +		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
> +
> +		xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
> +		xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
> +	}
> +}
> +
> +static void *poller(void *arg)
> +{
> +	(void)arg;
> +	for (;;) {
> +		sleep(opt_interval);
> +		dump_stats();
> +	}
> +
> +	return NULL;
> +}
> +
> +static void int_exit(int sig)
> +{
> +	(void)sig;
> +	dump_stats();
> +	bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
> +	exit(EXIT_SUCCESS);
> +}
> +
> +static struct option long_options[] = {
> +	{"rxdrop", no_argument, 0, 'r'},
> +	{"txonly", no_argument, 0, 't'},
> +	{"l2fwd", no_argument, 0, 'l'},
> +	{"interface", required_argument, 0, 'i'},
> +	{"queue", required_argument, 0, 'q'},
> +	{"poll", no_argument, 0, 'p'},
> +	{"shared-buffer", no_argument, 0, 's'},
> +	{"xdp-skb", no_argument, 0, 'S'},
> +	{"xdp-native", no_argument, 0, 'N'},
> +	{"interval", required_argument, 0, 'n'},
> +	{0, 0, 0, 0}
> +};
> +
> +static void usage(const char *prog)
> +{
> +	const char *str =
> +		"  Usage: %s [OPTIONS]\n"
> +		"  Options:\n"
> +		"  -r, --rxdrop		Discard all incoming packets (default)\n"
> +		"  -t, --txonly		Only send packets\n"
> +		"  -l, --l2fwd		MAC swap L2 forwarding\n"
> +		"  -i, --interface=n	Run on interface n\n"
> +		"  -q, --queue=n	Use queue n (default 0)\n"
> +		"  -p, --poll		Use poll syscall\n"
> +		"  -s, --shared-buffer	Use shared packet buffer\n"
> +		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
> +		"  -N, --xdp-native=n	Enfore XDP native mode\n"
> +		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
> +		"\n";
> +	fprintf(stderr, str, prog);
> +	exit(EXIT_FAILURE);
> +}
> +
> +static void parse_command_line(int argc, char **argv)
> +{
> +	int option_index, c;
> +
> +	opterr = 0;
> +
> +	for (;;) {
> +		c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
> +				&option_index);
> +		if (c == -1)
> +			break;
> +
> +		switch (c) {
> +		case 'r':
> +			opt_bench = BENCH_RXDROP;
> +			break;
> +		case 't':
> +			opt_bench = BENCH_TXONLY;
> +			break;
> +		case 'l':
> +			opt_bench = BENCH_L2FWD;
> +			break;
> +		case 'i':
> +			opt_if = optarg;
> +			break;
> +		case 'q':
> +			opt_queue = atoi(optarg);
> +			break;
> +		case 's':
> +			opt_shared_packet_buffer = 1;
> +			break;
> +		case 'p':
> +			opt_poll = 1;
> +			break;
> +		case 'S':
> +			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
> +			break;
> +		case 'N':
> +			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
> +			break;
> +		case 'n':
> +			opt_interval = atoi(optarg);
> +			break;
> +		default:
> +			usage(basename(argv[0]));
> +		}
> +	}
> +
> +	opt_ifindex = if_nametoindex(opt_if);
> +	if (!opt_ifindex) {
> +		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
> +			opt_if);
> +		usage(basename(argv[0]));
> +	}
> +}
> +
> +static void kick_tx(int fd)
> +{
> +	int ret;
> +
> +	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
> +		return;
> +	lassert(0);
> +}
> +
> +static inline void complete_tx_l2fwd(struct xdpsock *xsk)
> +{
> +	u32 descs[BATCH_SIZE];
> +	unsigned int rcvd;
> +	size_t ndescs;
> +
> +	if (!xsk->outstanding_tx)
> +		return;
> +
> +	kick_tx(xsk->sfd);
> +	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
> +		 xsk->outstanding_tx;
> +
> +	/* re-add completed Tx buffers */
> +	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
> +	if (rcvd > 0) {
> +		umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
> +		xsk->outstanding_tx -= rcvd;
> +		xsk->tx_npkts += rcvd;
> +	}
> +}
> +
> +static inline void complete_tx_only(struct xdpsock *xsk)
> +{
> +	u32 descs[BATCH_SIZE];
> +	unsigned int rcvd;
> +
> +	if (!xsk->outstanding_tx)
> +		return;
> +
> +	kick_tx(xsk->sfd);
> +
> +	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
> +	if (rcvd > 0) {
> +		xsk->outstanding_tx -= rcvd;
> +		xsk->tx_npkts += rcvd;
> +	}
> +}
> +
> +static void rx_drop(struct xdpsock *xsk)
> +{
> +	struct xdp_desc descs[BATCH_SIZE];
> +	unsigned int rcvd, i;
> +
> +	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
> +	if (!rcvd)
> +		return;
> +
> +	for (i = 0; i < rcvd; i++) {
> +		u32 idx = descs[i].idx;
> +
> +		lassert(idx < NUM_FRAMES);
> +#if DEBUG_HEXDUMP
> +		char *pkt;
> +		char buf[32];
> +
> +		pkt = xq_get_data(xsk, idx, descs[i].offset);
> +		sprintf(buf, "idx=%d", idx);
> +		hex_dump(pkt, descs[i].len, buf);
> +#endif
> +	}
> +
> +	xsk->rx_npkts += rcvd;
> +
> +	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
> +}
> +
> +static void rx_drop_all(void)
> +{
> +	struct pollfd fds[MAX_SOCKS + 1];
> +	int i, ret, timeout, nfds = 1;
> +
> +	memset(fds, 0, sizeof(fds));
> +
> +	for (i = 0; i < num_socks; i++) {
> +		fds[i].fd = xsks[i]->sfd;
> +		fds[i].events = POLLIN;
> +		timeout = 1000; /* 1sn */
> +	}
> +
> +	for (;;) {
> +		if (opt_poll) {
> +			ret = poll(fds, nfds, timeout);
> +			if (ret <= 0)
> +				continue;
> +		}
> +
> +		for (i = 0; i < num_socks; i++)
> +			rx_drop(xsks[i]);
> +	}
> +}
> +
> +static void tx_only(struct xdpsock *xsk)
> +{
> +	int timeout, ret, nfds = 1;
> +	struct pollfd fds[nfds + 1];
> +	unsigned int idx = 0;
> +
> +	memset(fds, 0, sizeof(fds));
> +	fds[0].fd = xsk->sfd;
> +	fds[0].events = POLLOUT;
> +	timeout = 1000; /* 1sn */
> +
> +	for (;;) {
> +		if (opt_poll) {
> +			ret = poll(fds, nfds, timeout);
> +			if (ret <= 0)
> +				continue;
> +
> +			if (fds[0].fd != xsk->sfd ||
> +			    !(fds[0].revents & POLLOUT))
> +				continue;
> +		}
> +
> +		if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
> +			lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
> +
> +			xsk->outstanding_tx += BATCH_SIZE;
> +			idx += BATCH_SIZE;
> +			idx %= NUM_FRAMES;
> +		}
> +
> +		complete_tx_only(xsk);
> +	}
> +}
> +
> +static void l2fwd(struct xdpsock *xsk)
> +{
> +	for (;;) {
> +		struct xdp_desc descs[BATCH_SIZE];
> +		unsigned int rcvd, i;
> +		int ret;
> +
> +		for (;;) {
> +			complete_tx_l2fwd(xsk);
> +
> +			rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
> +			if (rcvd > 0)
> +				break;
> +		}
> +
> +		for (i = 0; i < rcvd; i++) {
> +			char *pkt = xq_get_data(xsk, descs[i].idx,
> +						descs[i].offset);
> +
> +			swap_mac_addresses(pkt);
> +#if DEBUG_HEXDUMP
> +			char buf[32];
> +			u32 idx = descs[i].idx;
> +
> +			sprintf(buf, "idx=%d", idx);
> +			hex_dump(pkt, descs[i].len, buf);
> +#endif
> +		}
> +
> +		xsk->rx_npkts += rcvd;
> +
> +		ret = xq_enq(&xsk->tx, descs, rcvd);
> +		lassert(ret == 0);
> +		xsk->outstanding_tx += rcvd;
> +	}
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +	char xdp_filename[256];
> +	int i, ret, key = 0;
> +	pthread_t pt;
> +
> +	parse_command_line(argc, argv);
> +
> +	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
> +			strerror(errno));
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
> +
> +	if (load_bpf_file(xdp_filename)) {
> +		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (!prog_fd[0]) {
> +		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
> +			strerror(errno));
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
> +		fprintf(stderr, "ERROR: link set xdp fd failed\n");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
> +	if (ret) {
> +		fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Create sockets... */
> +	xsks[num_socks++] = xsk_configure(NULL);
> +
> +#if RR_LB
> +	for (i = 0; i < MAX_SOCKS - 1; i++)
> +		xsks[num_socks++] = xsk_configure(xsks[0]->umem);
> +#endif
> +
> +	/* ...and insert them into the map. */
> +	for (i = 0; i < num_socks; i++) {
> +		key = i;
> +		ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
> +		if (ret) {
> +			fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	signal(SIGINT, int_exit);
> +	signal(SIGTERM, int_exit);
> +	signal(SIGABRT, int_exit);
> +
> +	setlocale(LC_ALL, "");
> +
> +	ret = pthread_create(&pt, NULL, poller, NULL);
> +	lassert(ret == 0);
> +
> +	prev_time = get_nsecs();
> +
> +	if (opt_bench == BENCH_RXDROP)
> +		rx_drop_all();
> +	else if (opt_bench == BENCH_TXONLY)
> +		tx_only(xsks[0]);
> +	else
> +		l2fwd(xsks[0]);
> +
> +	return 0;
> +}
> -- 
> 2.14.1

^ permalink raw reply

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
From: Michael S. Tsirkin @ 2018-04-23 23:22 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, Björn Töpel, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180423135619.7179-1-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work
> has already been accepted. We will publish our zero-copy support for
> RX and TX on top of his patch sets at a later point in time.
> 
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
> 
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
> 
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this RFC, all
> packet data is copied out to user-space.
> 
> A new feature in this RFC is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
> 
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
> 
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
> 
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
> 
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
> 
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
> 
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> 
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
> 
> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.9(3.0)   9.4(9.3)  
> txpush       2.5(2.2)   NA*
> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.1(2.2)   3.3(3.1)  
> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
> 
> * NA since we have no support for TX using the XDP_DRV infrastructure
>   in this RFC. This is for a future patch set since it involves
>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>   Dangaard Brouer.
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32,921,521  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3,289,491   0
> 
> Changes from RFC V2:
> 
> * Optimizations and simplifications to the ring structures inspired by
>   ptr_ring.h 
> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>   consistent with AF_PACKET
> * Support for only having an RX queue or a TX queue defined
> * Some bug fixes and code cleanup
> 
> The structure of the patch set is as follows:
> 
> Patches 1-2: Basic socket and umem plumbing 
> Patches 3-10: RX support together with the new XSKMAP
> Patches 11-14: TX support
> Patch 15: Sample application
> 
> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
> Clean up btf.h in uapi")
> 
> Questions:
> 
> * How to deal with cache alignment for uapi when different
>   architectures can have different cache line sizes? We have just
>   aligned it to 64 bytes for now, which works for many popular
>   architectures, but not all. Please advise.
> 
> To do:
> 
> * Optimize performance
> 
> * Kernel selftest
> 
> Post-series plan:
> 
> * Kernel load module support of AF_XDP would be nice. Unclear how to
>   achieve this though since our XDP code depends on net/core.
> 
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to the user space socket.
> 
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>   XDP_PASS" for a tcpdump-like functionality.
> 
> * And of course getting to zero-copy support in small increments. 
> 
> Thanks: Björn and Magnus
> 
> Björn Töpel (8):
>   net: initial AF_XDP skeleton
>   xsk: add user memory registration support sockopt
>   xsk: add Rx queue setup and mmap support
>   xdp: introduce xdp_return_buff API
>   xsk: add Rx receive functions and poll support
>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>   xsk: wire up XDP_DRV side of AF_XDP
>   xsk: wire up XDP_SKB side of AF_XDP
> 
> Magnus Karlsson (7):
>   xsk: add umem fill queue support and mmap
>   xsk: add support for bind for Rx
>   xsk: add umem completion queue support and mmap
>   xsk: add Tx queue setup and mmap support
>   xsk: support for Tx
>   xsk: statistics support
>   samples/bpf: sample application for AF_XDP sockets
> 
>  MAINTAINERS                         |   8 +
>  include/linux/bpf.h                 |  26 +
>  include/linux/bpf_types.h           |   3 +
>  include/linux/filter.h              |   2 +-
>  include/linux/socket.h              |   5 +-
>  include/net/xdp.h                   |   1 +
>  include/net/xdp_sock.h              |  46 ++
>  include/uapi/linux/bpf.h            |   1 +
>  include/uapi/linux/if_xdp.h         |  87 ++++
>  kernel/bpf/Makefile                 |   3 +
>  kernel/bpf/verifier.c               |   8 +-
>  kernel/bpf/xskmap.c                 | 286 +++++++++++
>  net/Kconfig                         |   1 +
>  net/Makefile                        |   1 +
>  net/core/dev.c                      |  34 +-
>  net/core/filter.c                   |  40 +-
>  net/core/sock.c                     |  12 +-
>  net/core/xdp.c                      |  15 +-
>  net/xdp/Kconfig                     |   7 +
>  net/xdp/Makefile                    |   2 +
>  net/xdp/xdp_umem.c                  | 256 ++++++++++
>  net/xdp/xdp_umem.h                  |  65 +++
>  net/xdp/xdp_umem_props.h            |  23 +
>  net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>  net/xdp/xsk_queue.c                 |  73 +++
>  net/xdp/xsk_queue.h                 | 245 ++++++++++
>  samples/bpf/Makefile                |   4 +
>  samples/bpf/xdpsock.h               |  11 +
>  samples/bpf/xdpsock_kern.c          |  56 +++
>  samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>  security/selinux/hooks.c            |   4 +-
>  security/selinux/include/classmap.h |   4 +-
>  32 files changed, 2945 insertions(+), 35 deletions(-)
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 kernel/bpf/xskmap.c
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c

Is there a chance of Documentation/networking/af_xdp.txt ?


> 
> -- 
> 2.14.1

^ permalink raw reply

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
From: Michael S. Tsirkin @ 2018-04-23 23:21 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang
In-Reply-To: <20180423135619.7179-4-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, we add another setsockopt for registered user memory (umem)
> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
> ask the kernel to allocate a queue (ring buffer) and also mmap it
> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
> 
> The queue is used to explicitly pass ownership of umem frames from the
> user process to the kernel. These frames will in a later patch be
> filled in with Rx packet data by the kernel.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>  net/xdp/Makefile            |  2 +-
>  net/xdp/xdp_umem.c          |  5 ++++
>  net/xdp/xdp_umem.h          |  2 ++
>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>  7 files changed, 180 insertions(+), 2 deletions(-)
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 41252135a0fe..975661e1baca 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -23,6 +23,7 @@
>  
>  /* XDP socket options */
>  #define XDP_UMEM_REG			3
> +#define XDP_UMEM_FILL_RING		4
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>  	__u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_RING	0x100000000
> +
> +struct xdp_ring {
> +	__u32 producer __attribute__((aligned(64)));
> +	__u32 consumer __attribute__((aligned(64)));
> +};

Why 64? And do you still need these guys in uapi?

> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_ring {
> +	struct xdp_ring ptrs;
> +	__u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> index a5d736640a0f..074fb2b2d51c 100644
> --- a/net/xdp/Makefile
> +++ b/net/xdp/Makefile
> @@ -1,2 +1,2 @@
> -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
>  
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index bff058f5a769..6fc233e03f30 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
>  	struct mm_struct *mm;
>  	unsigned long diff;
>  
> +	if (umem->fq) {
> +		xskq_destroy(umem->fq);
> +		umem->fq = NULL;
> +	}
> +
>  	if (umem->pgs) {
>  		xdp_umem_unpin_pages(umem);
>  
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index 58714f4f7f25..3086091aebdd 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -18,9 +18,11 @@
>  #include <linux/mm.h>
>  #include <linux/if_xdp.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem_props.h"
>  
>  struct xdp_umem {
> +	struct xsk_queue *fq;
>  	struct page **pgs;
>  	struct xdp_umem_props props;
>  	u32 npgs;
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 19fc719cbe0d..bf6a1151df28 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -32,6 +32,7 @@
>  #include <linux/netdevice.h>
>  #include <net/sock.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem.h"
>  
>  struct xdp_sock {
> @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>  	return (struct xdp_sock *)sk;
>  }
>  
> +static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
> +{
> +	struct xsk_queue *q;
> +
> +	if (entries == 0 || *queue || !is_power_of_2(entries))
> +		return -EINVAL;
> +
> +	q = xskq_create(entries);
> +	if (!q)
> +		return -ENOMEM;
> +
> +	*queue = q;
> +	return 0;
> +}
> +
>  static int xsk_release(struct socket *sock)
>  {
>  	struct sock *sk = sock->sk;
> @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return 0;
>  	}
> +	case XDP_UMEM_FILL_RING:
> +	{
> +		struct xsk_queue **q;
> +		int entries;
> +
> +		if (!xs->umem)
> +			return -EINVAL;
> +
> +		if (copy_from_user(&entries, optval, sizeof(entries)))
> +			return -EFAULT;
> +
> +		mutex_lock(&xs->mutex);
> +		q = &xs->umem->fq;
> +		err = xsk_init_queue(entries, q);
> +		mutex_unlock(&xs->mutex);
> +		return err;
> +	}
>  	default:
>  		break;
>  	}
> @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  	return -ENOPROTOOPT;
>  }
>  
> +static int xsk_mmap(struct file *file, struct socket *sock,
> +		    struct vm_area_struct *vma)
> +{
> +	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
> +	unsigned long size = vma->vm_end - vma->vm_start;
> +	struct xdp_sock *xs = xdp_sk(sock->sk);
> +	struct xsk_queue *q;
> +	unsigned long pfn;
> +	struct page *qpg;
> +
> +	if (!xs->umem)
> +		return -EINVAL;
> +
> +	if (offset == XDP_UMEM_PGOFF_FILL_RING)
> +		q = xs->umem->fq;
> +	else
> +		return -EINVAL;
> +
> +	qpg = virt_to_head_page(q->ring);
> +	if (size > (PAGE_SIZE << compound_order(qpg)))
> +		return -EINVAL;
> +
> +	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
> +	return remap_pfn_range(vma, vma->vm_start, pfn,
> +			       size, vma->vm_page_prot);
> +}
> +
>  static struct proto xsk_proto = {
>  	.name =		"XDP",
>  	.owner =	THIS_MODULE,
> @@ -139,7 +199,7 @@ static const struct proto_ops xsk_proto_ops = {
>  	.getsockopt =	sock_no_getsockopt,
>  	.sendmsg =	sock_no_sendmsg,
>  	.recvmsg =	sock_no_recvmsg,
> -	.mmap =		sock_no_mmap,
> +	.mmap =		xsk_mmap,
>  	.sendpage =	sock_no_sendpage,
>  };
>  
> diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
> new file mode 100644
> index 000000000000..23da4f29d3fb
> --- /dev/null
> +++ b/net/xdp/xsk_queue.c
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* XDP user-space ring structure
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/slab.h>
> +
> +#include "xsk_queue.h"
> +
> +static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
> +{
> +	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
> +}
> +
> +struct xsk_queue *xskq_create(u32 nentries)
> +{
> +	struct xsk_queue *q;
> +	gfp_t gfp_flags;
> +	size_t size;
> +
> +	q = kzalloc(sizeof(*q), GFP_KERNEL);
> +	if (!q)
> +		return NULL;
> +
> +	q->nentries = nentries;
> +	q->ring_mask = nentries - 1;
> +
> +	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
> +		    __GFP_COMP  | __GFP_NORETRY;
> +	size = xskq_umem_get_ring_size(q);
> +
> +	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
> +						      get_order(size));
> +	if (!q->ring) {
> +		kfree(q);
> +		return NULL;
> +	}
> +
> +	return q;
> +}
> +
> +void xskq_destroy(struct xsk_queue *q)
> +{
> +	if (!q)
> +		return;
> +
> +	page_frag_free(q->ring);
> +	kfree(q);
> +}
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> new file mode 100644
> index 000000000000..7eb556bf73be
> --- /dev/null
> +++ b/net/xdp/xsk_queue.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * XDP user-space ring structure
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef _LINUX_XSK_QUEUE_H
> +#define _LINUX_XSK_QUEUE_H
> +
> +#include <linux/types.h>
> +#include <linux/if_xdp.h>
> +
> +#include "xdp_umem_props.h"
> +
> +struct xsk_queue {
> +	struct xdp_umem_props umem_props;
> +	u32 ring_mask;
> +	u32 nentries;
> +	u32 prod_head;
> +	u32 prod_tail;
> +	u32 cons_head;
> +	u32 cons_tail;
> +	struct xdp_ring *ring;
> +	u64 invalid_descs;
> +};
> +
> +struct xsk_queue *xskq_create(u32 nentries);
> +void xskq_destroy(struct xsk_queue *q);
> +
> +#endif /* _LINUX_XSK_QUEUE_H */
> -- 
> 2.14.1

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-23 23:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, bhutchings, Andrew Morton,
	David Miller, Vlastimil Babka
In-Reply-To: <20180423151015.GT17484@dhcp22.suse.cz>



On Mon, 23 Apr 2018, Michal Hocko wrote:

> On Mon 23-04-18 10:24:02, Mikulas Patocka wrote:
> 
> > > Really, we have a fault injection framework and this sounds like
> > > something to hook in there.
> > 
> > The testing people won't set it up. They install the "kernel-debug" 
> > package and run the tests in it.
> > 
> > If you introduce a hidden option that no one knows about, no one will use 
> > it.
> 
> then make sure people know about it. Fuzzers already do test fault
> injections.

I think that in the long term we can introduce a kernel parameter like 
"debug_level=1", "debug_level=2", "debug_level=3" that will turn on 
debugging features across all kernel subsystems and we can teach users to 
use it to diagnose problems.

But it won't work if every subsystem has different debug parameters. There 
are 192 distinct filenames in debugfs, if we add 193rd one, harly anyone 
notices it.

Mikulas

^ permalink raw reply

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
From: Michael S. Tsirkin @ 2018-04-23 23:16 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang
In-Reply-To: <20180423135619.7179-4-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, we add another setsockopt for registered user memory (umem)
> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
> ask the kernel to allocate a queue (ring buffer) and also mmap it
> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
> 
> The queue is used to explicitly pass ownership of umem frames from the
> user process to the kernel. These frames will in a later patch be
> filled in with Rx packet data by the kernel.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>  net/xdp/Makefile            |  2 +-
>  net/xdp/xdp_umem.c          |  5 ++++
>  net/xdp/xdp_umem.h          |  2 ++
>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>  7 files changed, 180 insertions(+), 2 deletions(-)
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 41252135a0fe..975661e1baca 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -23,6 +23,7 @@
>  
>  /* XDP socket options */
>  #define XDP_UMEM_REG			3
> +#define XDP_UMEM_FILL_RING		4
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>  	__u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_RING	0x100000000
> +
> +struct xdp_ring {
> +	__u32 producer __attribute__((aligned(64)));
> +	__u32 consumer __attribute__((aligned(64)));
> +};
> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_ring {
> +	struct xdp_ring ptrs;
> +	__u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> index a5d736640a0f..074fb2b2d51c 100644
> --- a/net/xdp/Makefile
> +++ b/net/xdp/Makefile
> @@ -1,2 +1,2 @@
> -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
>  
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index bff058f5a769..6fc233e03f30 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
>  	struct mm_struct *mm;
>  	unsigned long diff;
>  
> +	if (umem->fq) {
> +		xskq_destroy(umem->fq);
> +		umem->fq = NULL;
> +	}
> +
>  	if (umem->pgs) {
>  		xdp_umem_unpin_pages(umem);
>  
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index 58714f4f7f25..3086091aebdd 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -18,9 +18,11 @@
>  #include <linux/mm.h>
>  #include <linux/if_xdp.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem_props.h"
>  
>  struct xdp_umem {
> +	struct xsk_queue *fq;
>  	struct page **pgs;
>  	struct xdp_umem_props props;
>  	u32 npgs;
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 19fc719cbe0d..bf6a1151df28 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -32,6 +32,7 @@
>  #include <linux/netdevice.h>
>  #include <net/sock.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem.h"
>  
>  struct xdp_sock {
> @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>  	return (struct xdp_sock *)sk;
>  }
>  
> +static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
> +{
> +	struct xsk_queue *q;
> +
> +	if (entries == 0 || *queue || !is_power_of_2(entries))
> +		return -EINVAL;
> +
> +	q = xskq_create(entries);
> +	if (!q)
> +		return -ENOMEM;
> +
> +	*queue = q;
> +	return 0;
> +}
> +
>  static int xsk_release(struct socket *sock)
>  {
>  	struct sock *sk = sock->sk;
> @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return 0;
>  	}
> +	case XDP_UMEM_FILL_RING:
> +	{
> +		struct xsk_queue **q;
> +		int entries;
> +
> +		if (!xs->umem)
> +			return -EINVAL;
> +
> +		if (copy_from_user(&entries, optval, sizeof(entries)))
> +			return -EFAULT;
> +
> +		mutex_lock(&xs->mutex);
> +		q = &xs->umem->fq;
> +		err = xsk_init_queue(entries, q);
> +		mutex_unlock(&xs->mutex);
> +		return err;
> +	}
>  	default:
>  		break;
>  	}
> @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  	return -ENOPROTOOPT;
>  }
>  
> +static int xsk_mmap(struct file *file, struct socket *sock,
> +		    struct vm_area_struct *vma)
> +{
> +	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
> +	unsigned long size = vma->vm_end - vma->vm_start;
> +	struct xdp_sock *xs = xdp_sk(sock->sk);
> +	struct xsk_queue *q;
> +	unsigned long pfn;
> +	struct page *qpg;
> +
> +	if (!xs->umem)
> +		return -EINVAL;
> +
> +	if (offset == XDP_UMEM_PGOFF_FILL_RING)
> +		q = xs->umem->fq;
> +	else
> +		return -EINVAL;
> +
> +	qpg = virt_to_head_page(q->ring);
> +	if (size > (PAGE_SIZE << compound_order(qpg)))
> +		return -EINVAL;
> +
> +	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
> +	return remap_pfn_range(vma, vma->vm_start, pfn,
> +			       size, vma->vm_page_prot);
> +}
> +
>  static struct proto xsk_proto = {
>  	.name =		"XDP",
>  	.owner =	THIS_MODULE,
> @@ -139,7 +199,7 @@ static const struct proto_ops xsk_proto_ops = {
>  	.getsockopt =	sock_no_getsockopt,
>  	.sendmsg =	sock_no_sendmsg,
>  	.recvmsg =	sock_no_recvmsg,
> -	.mmap =		sock_no_mmap,
> +	.mmap =		xsk_mmap,
>  	.sendpage =	sock_no_sendpage,
>  };
>  
> diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
> new file mode 100644
> index 000000000000..23da4f29d3fb
> --- /dev/null
> +++ b/net/xdp/xsk_queue.c
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* XDP user-space ring structure
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/slab.h>
> +
> +#include "xsk_queue.h"
> +
> +static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
> +{
> +	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
> +}
> +
> +struct xsk_queue *xskq_create(u32 nentries)
> +{
> +	struct xsk_queue *q;
> +	gfp_t gfp_flags;
> +	size_t size;
> +
> +	q = kzalloc(sizeof(*q), GFP_KERNEL);
> +	if (!q)
> +		return NULL;
> +
> +	q->nentries = nentries;
> +	q->ring_mask = nentries - 1;
> +
> +	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
> +		    __GFP_COMP  | __GFP_NORETRY;
> +	size = xskq_umem_get_ring_size(q);
> +
> +	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
> +						      get_order(size));
> +	if (!q->ring) {
> +		kfree(q);
> +		return NULL;
> +	}
> +
> +	return q;
> +}
> +
> +void xskq_destroy(struct xsk_queue *q)
> +{
> +	if (!q)
> +		return;
> +
> +	page_frag_free(q->ring);
> +	kfree(q);
> +}
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> new file mode 100644
> index 000000000000..7eb556bf73be
> --- /dev/null
> +++ b/net/xdp/xsk_queue.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * XDP user-space ring structure
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef _LINUX_XSK_QUEUE_H
> +#define _LINUX_XSK_QUEUE_H
> +
> +#include <linux/types.h>
> +#include <linux/if_xdp.h>
> +
> +#include "xdp_umem_props.h"
> +
> +struct xsk_queue {
> +	struct xdp_umem_props umem_props;
> +	u32 ring_mask;
> +	u32 nentries;
> +	u32 prod_head;
> +	u32 prod_tail;
> +	u32 cons_head;
> +	u32 cons_tail;
> +	struct xdp_ring *ring;
> +	u64 invalid_descs;
> +};

Any documentation on how e.g. the locking works here?


> +
> +struct xsk_queue *xskq_create(u32 nentries);
> +void xskq_destroy(struct xsk_queue *q);
> +
> +#endif /* _LINUX_XSK_QUEUE_H */
> -- 
> 2.14.1

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Paul Moore @ 2018-04-23 23:15 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <20180421143443.faaput5g2rn6ul7p@madcap2.tricolour.ca>

On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-04-18 19:47, Paul Moore wrote:
>> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> > Implement the proc fs write to set the audit container ID of a process,
>> > emitting an AUDIT_CONTAINER record to document the event.
>> >
>> > This is a write from the container orchestrator task to a proc entry of
>> > the form /proc/PID/containerid where PID is the process ID of the newly
>> > created task that is to become the first task in a container, or an
>> > additional task added to a container.
>> >
>> > The write expects up to a u64 value (unset: 18446744073709551615).
>> >
>> > This will produce a record such as this:
>> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>> >
>> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
>> > the orchestrator while the "opid" field is the object's PID, the process
>> > being "contained".  Old and new container ID values are given in the
>> > "contid" fields, while res indicates its success.
>> >
>> > It is not permitted to self-set, unset or re-set the container ID.  A
>> > child inherits its parent's container ID, but then can be set only once
>> > after.
>> >
>> > See: https://github.com/linux-audit/audit-kernel/issues/32
>> >
>> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
>> > ---
>> >  fs/proc/base.c             | 37 ++++++++++++++++++++
>> >  include/linux/audit.h      | 16 +++++++++
>> >  include/linux/init_task.h  |  4 ++-
>> >  include/linux/sched.h      |  1 +
>> >  include/uapi/linux/audit.h |  2 ++
>> >  kernel/auditsc.c           | 84 ++++++++++++++++++++++++++++++++++++++++++++++
>> >  6 files changed, 143 insertions(+), 1 deletion(-)

...

>> > diff --git a/include/linux/sched.h b/include/linux/sched.h
>> > index d258826..1b82191 100644
>> > --- a/include/linux/sched.h
>> > +++ b/include/linux/sched.h
>> > @@ -796,6 +796,7 @@ struct task_struct {
>> >  #ifdef CONFIG_AUDITSYSCALL
>> >         kuid_t                          loginuid;
>> >         unsigned int                    sessionid;
>> > +       u64                             containerid;
>>
>> This one line addition to the task_struct scares me the most of
>> anything in this patchset.  Why?  It's a field named "containerid" in
>> a perhaps one of the most widely used core kernel structures; the
>> possibilities for abuse are endless, and it's foolish to think we
>> would ever be able to adequately police this.
>
> Fair enough.
>
>> Unfortunately, we can't add the field to audit_context as things
>> currently stand because we don't always allocate an audit_context,
>> it's dependent on the system's configuration, and we need to track the
>> audit container ID for a given process, regardless of the audit
>> configuration.  Pretty much the same reason why loginuid and sessionid
>> are located directly in task_struct now.  As I stressed during the
>> design phase, I really want to keep this as an *audit* container ID
>> and not a general purpose kernel wide container ID.  If the kernel
>> ever grows a general purpose container ID token, I'll be the first in
>> line to convert the audit code, but I don't want audit to be that
>> general purpose mechanism ... audit is hated enough as-is ;)
>
> When would we need an audit container ID when audit is not enabled
> enough to have an audit_context?

I'm thinking of the audit_alloc() case where audit_filter_task()
returns AUDIT_DISABLED.

I believe this is the same reason why loginuid and sessionid live
directly in the task_struct and not in the audit_context; they need to
persist for the lifetime of the task.

> If it is only used for audit, and audit is the only consumer, and audit
> can only use it when it is enabled, then we can just return success to
> any write to the proc filehandle, or not even present it.  Nothing will
> be able to know that value wasn't used.
>
> When are loginuid and sessionid used now when audit is not enabled (or
> should I say, explicitly disabled)?

See above.  I think that should answer these questions.

>> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
>> > index 4e61a9e..921a71f 100644
>> > --- a/include/uapi/linux/audit.h
>> > +++ b/include/uapi/linux/audit.h
>> > @@ -71,6 +71,7 @@
>> >  #define AUDIT_TTY_SET          1017    /* Set TTY auditing status */
>> >  #define AUDIT_SET_FEATURE      1018    /* Turn an audit feature on or off */
>> >  #define AUDIT_GET_FEATURE      1019    /* Get which features are enabled */
>> > +#define AUDIT_CONTAINER                1020    /* Define the container id and information */
>> >
>> >  #define AUDIT_FIRST_USER_MSG   1100    /* Userspace messages mostly uninteresting to kernel */
>> >  #define AUDIT_USER_AVC         1107    /* We filter this differently */
>> > @@ -465,6 +466,7 @@ struct audit_tty_status {
>> >  };
>> >
>> >  #define AUDIT_UID_UNSET (unsigned int)-1
>> > +#define AUDIT_CID_UNSET ((u64)-1)
>>
>> I think we need to decide if we want to distinguish between the "host"
>> (e.g. init ns) and "unset".  Looking at this patch (I've only quickly
>> skimmed the others so far) it would appear that you don't think we
>> need to worry about this distinction; that's fine, but let's make it
>> explicit with a comment in the code that AUDIT_CID_UNSET means "unset"
>> as well as "host".
>
> I don't see any reason to distinguish between "host" and "unset".  Since
> a container doesn't have a concrete definition based in namespaces, the
> initial namespace set is meaningless here.

Okay, that sounds reasonable.

> Is there value in having a container orchestrator process have a
> reserved container ID that has a policy distinct from any other
> container?

I'm open to arguments for this idea, but I don't see a point to it right now.

> If so, then I could see the value in making the distinction.
> For example, I've heard of interest in systemd acting as a container
> orchestrator, so if it took on that role as PID 1, then every process in
> the system would inherit that ID and none would be unset.
>
> I can't picture how having seperate "host" and "unset" values helps us.

I don't have a strong feeling either way, I just wanted to ask the question.

>> >  /* audit_rule_data supports filter rules with both integer and string
>> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> > index 4e0a4ac..29c8482 100644
>> > --- a/kernel/auditsc.c
>> > +++ b/kernel/auditsc.c
>> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
>> >         return rc;
>> >  }
>> >
>> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>> > +{
>> > +       struct task_struct *parent;
>> > +       u64 pcontainerid, ccontainerid;
>> > +
>> > +       /* Don't allow to set our own containerid */
>> > +       if (current == task)
>> > +               return -EPERM;
>>
>> Why not?  Is there some obvious security concern that I missing?
>
> We then lose the distinction in the AUDIT_CONTAINER record between the
> initiating PID and the target PID.  This was outlined in the proposal.

I just went back and reread the v3 proposal and I still don't see a
good explanation of this.  Why is this bad?  What's the security
concern?

> Having said that, I'm still not sure we have protected sufficiently from
> a child turning around and setting it's parent's as yet unset or
> inherited audit container ID.

Yes, I believe we only want to let a task set the audit container for
it's children (or itself/threads if we decide to allow that, see
above).  There *has* to be a function to check to see if a task if a
child of a given task ... right? ... although this is likely to be a
pointer traversal and locking nightmare ... hmmm.

>> I ask because I suppose it might be possible for some container
>> runtime to do a fork, setup some of the environment and them exec the
>> container (before you answer the obvious "namespaces!" please remember
>> we're not trying to define containers).
>
> I don't think namespaces have any bearing on this concern since none are
> required.
>
>> > +       /* Don't allow the containerid to be unset */
>> > +       if (!cid_valid(containerid))
>> > +               return -EINVAL;
>> > +       /* if we don't have caps, reject */
>> > +       if (!capable(CAP_AUDIT_CONTROL))
>> > +               return -EPERM;
>> > +       /* if containerid is unset, allow */
>> > +       if (!audit_containerid_set(task))
>> > +               return 0;
>> > +       /* it is already set, and not inherited from the parent, reject */
>> > +       ccontainerid = audit_get_containerid(task);
>> > +       rcu_read_lock();
>> > +       parent = rcu_dereference(task->real_parent);
>> > +       rcu_read_unlock();
>> > +       task_lock(parent);
>> > +       pcontainerid = audit_get_containerid(parent);
>> > +       task_unlock(parent);
>> > +       if (ccontainerid != pcontainerid)
>> > +               return -EPERM;
>> > +       return 0;
>> > +}
>> > +
>> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
>> > +                                     u64 containerid, int rc)
>> > +{
>> > +       struct audit_buffer *ab;
>> > +       uid_t uid;
>> > +       struct tty_struct *tty;
>> > +
>> > +       if (!audit_enabled)
>> > +               return;
>> > +
>> > +       ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
>> > +       if (!ab)
>> > +               return;
>> > +
>> > +       uid = from_kuid(&init_user_ns, task_uid(current));
>> > +       tty = audit_get_tty(current);
>> > +
>> > +       audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
>> > +       audit_log_task_context(ab);
>> > +       audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
>> > +                        from_kuid(&init_user_ns, audit_get_loginuid(current)),
>> > +                        tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
>> > +                        task_tgid_nr(task), oldcontainerid, containerid, !rc);
>> > +
>> > +       audit_put_tty(tty);
>> > +       audit_log_end(ab);
>> > +}
>> > +
>> > +/**
>> > + * audit_set_containerid - set current task's audit_context containerid
>> > + * @containerid: containerid value
>> > + *
>> > + * Returns 0 on success, -EPERM on permission failure.
>> > + *
>> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
>> > + */
>> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
>> > +{
>> > +       u64 oldcontainerid;
>> > +       int rc;
>> > +
>> > +       oldcontainerid = audit_get_containerid(task);
>> > +
>> > +       rc = audit_set_containerid_perm(task, containerid);
>> > +       if (!rc) {
>> > +               task_lock(task);
>> > +               task->containerid = containerid;
>> > +               task_unlock(task);
>> > +       }
>> > +
>> > +       audit_log_set_containerid(task, oldcontainerid, containerid, rc);
>> > +       return rc;
>>
>> Why are audit_set_containerid_perm() and audit_log_containerid()
>> separate functions?
>
> (I assume you mean audit_log_set_containerid()?)

Yep.  My fingers got tired typing in that function name and decided a
shortcut was necessary.

> It seemed clearer that all the permission checking was in one function
> and its return code could be used to report the outcome when logging the
> (attempted) action.  This is the same structure as audit_set_loginuid()
> and it made sense.

When possible I really like it when the permission checks are in the
same function as the code which does the work; it's less likely to get
abused that way (you have to willfully bypass the access checks).  The
exceptions might be if you wanted to reuse the access control code, or
insert a modular access mechanism (e.g. LSMs).

I'm less concerned about audit_log_set_containerid(), but the usual
idea of avoiding single-use function within the same scope applies
here.

> This would be the time to connect it to a syscall if that seems like a
> good idea and remove pid, uid, auid, tty, ses fields.

Ah yes, I missed that.  You know my stance on connecting records by
now (hint: yes, connect them) so I think that would be a good thing to
do for the next round.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: dev_loopback_xmit parameters
From: Emanuele @ 2018-04-23 23:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Network Developers
In-Reply-To: <c91f52c8-c7de-e555-a184-b62d39665b9f@gmail.com>

Ok, clear now.

Even though I don't understand what to set to avoid triggering the
WARN_ON(!skb_dst(skb));
inside dev_loopback_xmit.
I just would like to send the skb in loopback, i.e. moving the packet 
from the sending to the receiving queue of a certain struct net_device.


On 24/04/2018 00:36, Eric Dumazet wrote:
>
> On 04/23/2018 02:40 PM, Emanuele wrote:
>> Hello,
>>
>> I don't know if this is the right place where to ask, but I was wondering why the dev_loopback_xmit function defined in /net/core/dev.c takes struct net * and struct sock *  as parameters. They are never used, so I believe passing only the struct  sk_buff * should be enough.
>>
> Look at net/ipv6/ip6_output.c where NF_HOOK() uses dev_loopback_xmit().
>
>> In addition, it would like to know where I can read what is and how to set a skb dst_entry, since I don't really understand it.
>>
>> Thanks a lot,
>>
>> Emanuele
>>

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Willem de Bruijn @ 2018-04-23 23:04 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <20180423135619.7179-3-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> In this commit the base structure of the AF_XDP address family is set
> up. Further, we introduce the abilty register a window of user memory
> to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
> window is viewed by an AF_XDP socket as a set of equally large
> frames. After a user memory registration all frames are "owned" by the
> user application, and not the kernel.
>
> Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

> +static void xdp_umem_release(struct xdp_umem *umem)
> +{
> +       struct task_struct *task;
> +       struct mm_struct *mm;
> +       unsigned long diff;
> +
> +       if (umem->pgs) {
> +               xdp_umem_unpin_pages(umem);
> +
> +               task = get_pid_task(umem->pid, PIDTYPE_PID);
> +               put_pid(umem->pid);
> +               if (!task)
> +                       goto out;
> +               mm = get_task_mm(task);
> +               put_task_struct(task);
> +               if (!mm)
> +                       goto out;
> +
> +               diff = umem->size >> PAGE_SHIFT;

Need to round up or size must always be a multiple of PAGE_SIZE.

> +
> +               down_write(&mm->mmap_sem);
> +               mm->pinned_vm -= diff;
> +               up_write(&mm->mmap_sem);

When using user->locked_vm for resource limit checks, no need
to also update mm->pinned_vm?

> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> +{
> +       u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> +       u64 addr = mr->addr, size = mr->len;
> +       unsigned int nframes;
> +       int size_chk, err;
> +
> +       if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> +               /* Strictly speaking we could support this, if:
> +                * - huge pages, or*
> +                * - using an IOMMU, or
> +                * - making sure the memory area is consecutive
> +                * but for now, we simply say "computer says no".
> +                */
> +               return -EINVAL;
> +       }

Ideally, AF_XDP subsumes all packet socket use cases. It does not
have packet v3's small packet optimizations of variable sized frames
and block signaling.

I don't suggest adding that now. But for the non-zerocopy case, it may
make sense to ensure that nothing is blocking a later addition of these
features. Especially for header-only (snaplen) workloads. So far, I don't
see any issues.

> +       if (!is_power_of_2(frame_size))
> +               return -EINVAL;
> +
> +       if (!PAGE_ALIGNED(addr)) {
> +               /* Memory area has to be page size aligned. For
> +                * simplicity, this might change.
> +                */
> +               return -EINVAL;
> +       }
> +
> +       if ((addr + size) < addr)
> +               return -EINVAL;
> +
> +       nframes = size / frame_size;
> +       if (nframes == 0 || nframes > UINT_MAX)
> +               return -EINVAL;

You may also want a check here that nframes * frame_size is at least
PAGE_SIZE and probably a multiple of that.

> +       frame_headroom = ALIGN(frame_headroom, 64);
> +
> +       size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
> +       if (size_chk < 0)
> +               return -EINVAL;
> +
> +       umem->pid = get_task_pid(current, PIDTYPE_PID);
> +       umem->size = (size_t)size;
> +       umem->address = (unsigned long)addr;
> +       umem->props.frame_size = frame_size;
> +       umem->props.nframes = nframes;
> +       umem->frame_headroom = frame_headroom;
> +       umem->npgs = size / PAGE_SIZE;
> +       umem->pgs = NULL;
> +       umem->user = NULL;
> +
> +       umem->frame_size_log2 = ilog2(frame_size);
> +       umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1;
> +       umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
> +       atomic_set(&umem->users, 1);
> +
> +       err = xdp_umem_account_pages(umem);
> +       if (err)
> +               goto out;
> +
> +       err = xdp_umem_pin_pages(umem);
> +       if (err)

need to call xdp_umem_unaccount_pages on error
> +               goto out;
> +       return 0;
> +
> +out:
> +       put_pid(umem->pid);
> +       return err;
> +}

^ permalink raw reply

* [PATCH v2] selftests: bpf: update .gitignore with missing file
From: Anders Roxell @ 2018-04-23 22:53 UTC (permalink / raw)
  To: ast, daniel, shuah; +Cc: netdev, linux-kernel, linux-kselftest, Anders Roxell
In-Reply-To: <065a8bc2-6065-3d43-8b03-35ed693de7ce@iogearbox.net>

Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
---
Rebased against bpf-next.

 tools/testing/selftests/bpf/.gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 9cf83f895d98..da19f0562bf8 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -12,3 +12,4 @@ test_tcpbpf_user
 test_verifier_log
 feature
 test_libbpf_open
+test_btf
-- 
2.17.0

^ permalink raw reply related

* Re: [bpf PATCH v3 0/3] BPF, a couple sockmap fixes
From: Daniel Borkmann @ 2018-04-23 22:52 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev
In-Reply-To: <20180423223712.16388.63625.stgit@john-Precision-Tower-5810>

On 04/24/2018 12:39 AM, John Fastabend wrote:
> While testing sockmap with more programs (besides our test programs)
> I found a couple issues.
> 
> The attached series fixes an issue where pinned maps were not
> working correctly, blocking sockets returned zero, and an error
> path that when the sock hit an out of memory case resulted in a
> double page_put() while doing ingress redirects.
> 
> See individual patches for more details.
> 
> v2: Incorporated Daniel's feedback to use map ops for uref put op
>     which also fixed the build error discovered in v1.
> v3: rename map_put_uref to map_release_uref

Applied to bpf tree, thanks John!

^ permalink raw reply

* [PATCH net v2] net: ethtool: Add missing kernel doc for FEC parameters
From: Florian Fainelli @ 2018-04-23 22:51 UTC (permalink / raw)
  To: netdev; +Cc: davem, vidya.chowdary, dustin, roopa, Florian Fainelli

While adding support for ethtool::get_fecparam and set_fecparam, kernel
doc for these functions was missed, add those.

Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction modes")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v2:

- corrected set_fecparam in commit message

 include/linux/ethtool.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ebe41811ed34..b32cd2062f18 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -310,6 +310,8 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
  *	fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS
  *	instead of the latter), any change to them will be overwritten
  *	by kernel. Returns a negative error code or zero.
+ * @get_fecparam: Get the network device Forward Error Correction parameters.
+ * @set_fecparam: Set the network device Forward Error Correction parameters.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
-- 
2.14.1

^ permalink raw reply related

* [bpf PATCH v3 3/3] bpf: sockmap, fix double page_put on ENOMEM error in redirect path
From: John Fastabend @ 2018-04-23 22:39 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180423223712.16388.63625.stgit@john-Precision-Tower-5810>

In the case where the socket memory boundary is hit the redirect
path returns an ENOMEM error. However, before checking for this
condition the redirect scatterlist buffer is setup with a valid
page and length. This is never unwound so when the buffers are
released latter in the error path we do a put_page() and clear
the scatterlist fields. But, because the initial error happens
before completing the scatterlist buffer we end up with both the
original buffer and the redirect buffer pointing to the same page
resulting in duplicate put_page() calls.

To fix this simply move the initial configuration of the redirect
scatterlist buffer below the sock memory check.

Found this while running TCP_STREAM test with netperf using Cilium.

Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index aaf50ec..634415c 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -524,8 +524,6 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
 	i = md->sg_start;
 
 	do {
-		r->sg_data[i] = md->sg_data[i];
-
 		size = (apply && apply_bytes < md->sg_data[i].length) ?
 			apply_bytes : md->sg_data[i].length;
 
@@ -536,6 +534,7 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
 		}
 
 		sk_mem_charge(sk, size);
+		r->sg_data[i] = md->sg_data[i];
 		r->sg_data[i].length = size;
 		md->sg_data[i].length -= size;
 		md->sg_data[i].offset += size;

^ permalink raw reply related

* [bpf PATCH v3 2/3] bpf: sockmap, sk_wait_event needed to handle blocking cases
From: John Fastabend @ 2018-04-23 22:39 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180423223712.16388.63625.stgit@john-Precision-Tower-5810>

In the recvmsg handler we need to add a wait event to support the
blocking use cases. Without this we return zero and may confuse
user applications. In the wait event any data received on the
sk either via sk_receive_queue or the psock ingress list will
wake up the sock.

Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index a73d484..aaf50ec 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -43,6 +43,7 @@
 #include <net/tcp.h>
 #include <linux/ptr_ring.h>
 #include <net/inet_common.h>
+#include <linux/sched/signal.h>
 
 #define SOCK_CREATE_FLAG_MASK \
 	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
@@ -732,6 +733,26 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
 	return err;
 }
 
+static int bpf_wait_data(struct sock *sk,
+			 struct smap_psock *psk, int flags,
+			 long timeo, int *err)
+{
+	int rc;
+
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
+
+	add_wait_queue(sk_sleep(sk), &wait);
+	sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	rc = sk_wait_event(sk, &timeo,
+			   !list_empty(&psk->ingress) ||
+			   !skb_queue_empty(&sk->sk_receive_queue),
+			   &wait);
+	sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	remove_wait_queue(sk_sleep(sk), &wait);
+
+	return rc;
+}
+
 static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			   int nonblock, int flags, int *addr_len)
 {
@@ -755,6 +776,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
 
 	lock_sock(sk);
+bytes_ready:
 	while (copied != len) {
 		struct scatterlist *sg;
 		struct sk_msg_buff *md;
@@ -809,6 +831,28 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 	}
 
+	if (!copied) {
+		long timeo;
+		int data;
+		int err = 0;
+
+		timeo = sock_rcvtimeo(sk, nonblock);
+		data = bpf_wait_data(sk, psock, flags, timeo, &err);
+
+		if (data) {
+			if (!skb_queue_empty(&sk->sk_receive_queue)) {
+				release_sock(sk);
+				smap_release_sock(psock, sk);
+				copied = tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+				return copied;
+			}
+			goto bytes_ready;
+		}
+
+		if (err)
+			copied = err;
+	}
+
 	release_sock(sk);
 	smap_release_sock(psock, sk);
 	return copied;

^ permalink raw reply related

* [bpf PATCH v3 1/3] bpf: sockmap, map_release does not hold refcnt for pinned maps
From: John Fastabend @ 2018-04-23 22:39 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180423223712.16388.63625.stgit@john-Precision-Tower-5810>

Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.

This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.

Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 0 files changed

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0ac4340..a5782d0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -33,6 +33,7 @@ struct bpf_map_ops {
 	void (*map_release)(struct bpf_map *map, struct file *map_file);
 	void (*map_free)(struct bpf_map *map);
 	int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
+	void (*map_release_uref)(struct bpf_map *map);
 
 	/* funcs callable from userspace and from eBPF programs */
 	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
@@ -448,7 +449,6 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
 int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file,
 				 void *key, void *value, u64 map_flags);
 int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
-void bpf_fd_array_map_clear(struct bpf_map *map);
 int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
 				void *key, void *value, u64 map_flags);
 int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 02a1893..0fd8d8f 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -526,7 +526,7 @@ static u32 prog_fd_array_sys_lookup_elem(void *ptr)
 }
 
 /* decrement refcnt of all bpf_progs that are stored in this map */
-void bpf_fd_array_map_clear(struct bpf_map *map)
+static void bpf_fd_array_map_clear(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
@@ -545,6 +545,7 @@ void bpf_fd_array_map_clear(struct bpf_map *map)
 	.map_fd_get_ptr = prog_fd_array_get_ptr,
 	.map_fd_put_ptr = prog_fd_array_put_ptr,
 	.map_fd_sys_lookup_elem = prog_fd_array_sys_lookup_elem,
+	.map_release_uref = bpf_fd_array_map_clear,
 };
 
 static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file,
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index a3b2138..a73d484 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1831,7 +1831,7 @@ static int sock_map_update_elem(struct bpf_map *map,
 	return err;
 }
 
-static void sock_map_release(struct bpf_map *map, struct file *map_file)
+static void sock_map_release(struct bpf_map *map)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
 	struct bpf_prog *orig;
@@ -1855,7 +1855,7 @@ static void sock_map_release(struct bpf_map *map, struct file *map_file)
 	.map_get_next_key = sock_map_get_next_key,
 	.map_update_elem = sock_map_update_elem,
 	.map_delete_elem = sock_map_delete_elem,
-	.map_release = sock_map_release,
+	.map_release_uref = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index fe23dc5a..a22c26e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -260,8 +260,8 @@ static void bpf_map_free_deferred(struct work_struct *work)
 static void bpf_map_put_uref(struct bpf_map *map)
 {
 	if (atomic_dec_and_test(&map->usercnt)) {
-		if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
-			bpf_fd_array_map_clear(map);
+		if (map->ops->map_release_uref)
+			map->ops->map_release_uref(map);
 	}
 }
 

^ permalink raw reply related

* [bpf PATCH v3 0/3] BPF, a couple sockmap fixes
From: John Fastabend @ 2018-04-23 22:39 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

While testing sockmap with more programs (besides our test programs)
I found a couple issues.

The attached series fixes an issue where pinned maps were not
working correctly, blocking sockets returned zero, and an error
path that when the sock hit an out of memory case resulted in a
double page_put() while doing ingress redirects.

See individual patches for more details.

v2: Incorporated Daniel's feedback to use map ops for uref put op
    which also fixed the build error discovered in v1.
v3: rename map_put_uref to map_release_uref

---

John Fastabend (3):
      bpf: sockmap, map_release does not hold refcnt for pinned maps
      bpf: sockmap, sk_wait_event needed to handle blocking cases
      bpf: sockmap, fix double page_put on ENOMEM error in redirect path


 0 files changed

^ permalink raw reply

* Re: dev_loopback_xmit parameters
From: Eric Dumazet @ 2018-04-23 22:36 UTC (permalink / raw)
  To: Emanuele, Linux Kernel Network Developers
In-Reply-To: <eb1bf67b-7f38-a3b5-4974-6ee1d198887f@gmail.com>



On 04/23/2018 02:40 PM, Emanuele wrote:
> Hello,
> 
> I don't know if this is the right place where to ask, but I was wondering why the dev_loopback_xmit function defined in /net/core/dev.c takes struct net * and struct sock *  as parameters. They are never used, so I believe passing only the struct  sk_buff * should be enough.
> 

Look at net/ipv6/ip6_output.c where NF_HOOK() uses dev_loopback_xmit().

> In addition, it would like to know where I can read what is and how to set a skb dst_entry, since I don't really understand it.
> 
> Thanks a lot,
> 
> Emanuele
> 

^ permalink raw reply

* Re: [PATCH] selftests: bpf: update .gitignore with missing file
From: Daniel Borkmann @ 2018-04-23 22:31 UTC (permalink / raw)
  To: Anders Roxell
  Cc: ast, Shuah Khan, netdev, Linux Kernel Mailing List,
	linux-kselftest
In-Reply-To: <CADYN=9KAorrGBh+2_LZ+6cxaf4yU-pOmU4BSzkMKT8sdU7hhUw@mail.gmail.com>

On 04/24/2018 12:14 AM, Anders Roxell wrote:
> On 23 April 2018 at 23:34, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 04/23/2018 03:50 PM, Anders Roxell wrote:
>>> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
>>> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
>>> ---
>>>  tools/testing/selftests/bpf/.gitignore | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
>>> index 5e1ab2f0eb79..3e3b3ced3f7c 100644
>>> --- a/tools/testing/selftests/bpf/.gitignore
>>> +++ b/tools/testing/selftests/bpf/.gitignore
>>> @@ -15,3 +15,4 @@ test_libbpf_open
>>>  test_sock
>>>  test_sock_addr
>>>  urandom_read
>>> +test_btf
>>
>> Against which tree is this? This doesn't apply to bpf-next. It would
>> apply against bpf tree, but c0fa1b6c3efc ("bpf: btf: Add BTF tests")
>> is part of bpf-next, so fits to neither of them.
> 
> I'm sorry,
> 
> I did it against this patch [1] that I thought was already applied to
> the bpf tree.

That was bpf tree since the original change already went to mainline; the
BTF is in bpf-next however, so you need to rebase your change against that.

Thanks,
Daniel

^ permalink raw reply

* [PATCH 4/4] rhashtable: improve rhashtable_walk stability when stop/start used.
From: NeilBrown @ 2018-04-23 22:29 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu, David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <152452244405.1456.8175298512483573078.stgit@noble>

When a walk of an rhashtable is interrupted with rhastable_walk_stop()
and then rhashtable_walk_start(), the location to restart from is based
on a 'skip' count in the current hash chain, and this can be incorrect
if insertions or deletions have happened.  This does not happen when
the walk is not stopped and started as iter->p is a placeholder which
is safe to use while holding the RCU read lock.

In rhashtable_walk_start() we can revalidate that 'p' is still in the
same hash chain.  If it isn't then the current method is still used.

With this patch, if a rhashtable walker ensures that the current
object remains in the table over a stop/start period (possibly by
elevating the reference count if that is sufficient), it can be sure
that a walk will not miss objects that were in the hashtable for the
whole time of the walk.

rhashtable_walk_start() may not find the object even though it is
still in the hashtable if a rehash has moved it to a new table.  In
this case it will (eventually) get -EAGAIN and will need to proceed
through the whole table again to be sure to see everything at least
once.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/rhashtable.c |   44 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 81edf1ab38ab..9427b5766134 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -727,6 +727,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
 	__acquires(RCU)
 {
 	struct rhashtable *ht = iter->ht;
+	bool rhlist = ht->rhlist;
 
 	rcu_read_lock();
 
@@ -735,13 +736,52 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
 		list_del(&iter->walker.list);
 	spin_unlock(&ht->lock);
 
-	if (!iter->walker.tbl && !iter->end_of_table) {
+	if (iter->end_of_table)
+		return 0;
+	if (!iter->walker.tbl) {
 		iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
 		iter->slot = 0;
 		iter->skip = 0;
 		return -EAGAIN;
 	}
 
+	if (iter->p && !rhlist) {
+		/*
+		 * We need to validate that 'p' is still in the table, and
+		 * if so, update 'skip'
+		 */
+		struct rhash_head *p;
+		int skip = 0;
+		rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+			skip++;
+			if (p == iter->p) {
+				iter->skip = skip;
+				goto found;
+			}
+		}
+		iter->p = NULL;
+	} else if (iter->p && rhlist) {
+		/* Need to validate that 'list' is still in the table, and
+		 * if so, update 'skip' and 'p'.
+		 */
+		struct rhash_head *p;
+		struct rhlist_head *list;
+		int skip = 0;
+		rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+			for (list = container_of(p, struct rhlist_head, rhead);
+			     list;
+			     list = rcu_dereference(list->next)) {
+				skip++;
+				if (list == iter->list) {
+					iter->p = p;
+					skip = skip;
+					goto found;
+				}
+			}
+		}
+		iter->p = NULL;
+	}
+found:
 	return 0;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_start_check);
@@ -917,8 +957,6 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
 		iter->walker.tbl = NULL;
 	spin_unlock(&ht->lock);
 
-	iter->p = NULL;
-
 out:
 	rcu_read_unlock();
 }

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox