netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] illustrate cmpxchg ring for tap/tun and qdisc
@ 2016-11-11  4:43 John Fastabend
  2016-11-11  4:44 ` [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings John Fastabend
  2016-11-11  4:44 ` [RFC PATCH 2/2] ptr_ring_ll: pop/push multiple objects at once John Fastabend
  0 siblings, 2 replies; 12+ messages in thread
From: John Fastabend @ 2016-11-11  4:43 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, linux-kernel

This is a cmpxchg ring that could potentially be used to replace the spinlock
variant of skb_array.

Couple comments its missing a resize operator which would need to be added
before it could be a drop in replacement. And although I tested the first
patch with my latest set of qdisc updates the second patch I have not tested
for a bit so it may or may not work but it illustrates the idea at least.

This was derived from the DPDK documentation where there is a description
of the cmpxchg ring presumably used there. It may or may not actually align
with what is done in DPDK implementation I have no idea I didn't look.


---

John Fastabend (2):
      net: use cmpxchg instead of spinlock in ptr rings
      ptr_ring_ll: pop/push multiple objects at once


 include/linux/netdevice.h   |   12 ----
 include/linux/ptr_ring_ll.h |  146 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/skb_array.h   |   32 +++++++++
 net/core/dev.c              |   18 +----
 net/sched/sch_generic.c     |   86 +++++++++++++------------
 5 files changed, 224 insertions(+), 70 deletions(-)
 create mode 100644 include/linux/ptr_ring_ll.h

^ permalink raw reply	[flat|nested] 12+ messages in thread
* [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings
@ 2016-11-15 13:32 Jesper Dangaard Brouer
  2016-11-15 14:30 ` Michael S. Tsirkin
  2016-11-16  4:37 ` John Fastabend
  0 siblings, 2 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-15 13:32 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: brouer, John Fastabend, Michael S. Tsirkin


(looks like my message didn't reach the netdev list, due to me sending
from the wrong email, forwarded message again):

On Thu, 10 Nov 2016 20:44:08 -0800 John Fastabend <john.fastabend@gmail.com> wrote:

> ---
>  include/linux/ptr_ring_ll.h |  136 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/skb_array.h   |   25 ++++++++
>  2 files changed, 161 insertions(+)
>  create mode 100644 include/linux/ptr_ring_ll.h
> 
> diff --git a/include/linux/ptr_ring_ll.h b/include/linux/ptr_ring_ll.h
> new file mode 100644
> index 0000000..bcb11f3
> --- /dev/null
> +++ b/include/linux/ptr_ring_ll.h
> @@ -0,0 +1,136 @@
> +/*
> + *	Definitions for the 'struct ptr_ring_ll' datastructure.
> + *
> + *	Author:
> + *		John Fastabend <john.r.fastabend@intel.com>  
[...]
> + *
> + *	This is a limited-size FIFO maintaining pointers in FIFO order, with
> + *	one CPU producing entries and another consuming entries from a FIFO.
> + *	extended from ptr_ring_ll to use cmpxchg over spin lock.  

It sounds like this is Single Producer Single Consumer (SPSC)
implementation, but your implementation actually is Multi Producer
Multi Consumer (MPMC) capable.

The implementation looks a lot like my alf_queue[1] implementation:
 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h

If the primary use-case is one CPU producing and another consuming,
then the normal ptr_ring (skb_array) will actually be faster!

The reason is ptr_ring avoids bouncing a cache-line between the CPUs on
every ring access.  This is achieved by having the checks for full
(__ptr_ring_full) and empty (__ptr_ring_empty) use the contents of the
array (NULL value).

I actually implemented two micro-benchmarks to measure the difference
between skb_array[2] and alf_queue[3]:
 [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
 [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c


> + */
> +
> +#ifndef _LINUX_PTR_RING_LL_H
> +#define _LINUX_PTR_RING_LL_H 1
> +  
[...]
> +
> +struct ptr_ring_ll {
> +	u32 prod_size;
> +	u32 prod_mask;
> +	u32 prod_head;
> +	u32 prod_tail;
> +	u32 cons_size;
> +	u32 cons_mask;
> +	u32 cons_head;
> +	u32 cons_tail;
> +
> +	void **queue;
> +};  

Your implementation doesn't even split the consumer and producer into
different cachelines (which in practice doesn't help much due to how
the empty/full checks are performed).

> +
> +/* Note: callers invoking this in a loop must use a compiler barrier,
> + * for example cpu_relax(). Callers must hold producer_lock.
> + */
> +static inline int __ptr_ring_ll_produce(struct ptr_ring_ll *r, void *ptr)
> +{
> +	u32 ret, head, tail, next, slots, mask;
> +
> +	do {
> +		head = READ_ONCE(r->prod_head);
> +		mask = READ_ONCE(r->prod_mask);
> +		tail = READ_ONCE(r->cons_tail);  

Problem occur here, as the producer need to access/read the consumers
tail, to determine if the queue is not already full (slots avail).
Thus, the next "consumer-CPU" will see the cacheline in wrong state
(Modified/Invalid or Shared).

> +
> +		slots = mask + tail - head;
> +		if (slots < 1)
> +			return -ENOMEM;
> +
> +		next = head + 1;
> +		ret = cmpxchg(&r->prod_head, head, next);
> +	} while (ret != head);
> +
> +	r->queue[head & mask] = ptr;
> +	smp_wmb();
> +
> +	while (r->prod_tail != head)
> +		cpu_relax();
> +
> +	r->prod_tail = next;
> +	return 0;
> +}
> +
> +static inline void *__ptr_ring_ll_consume(struct ptr_ring_ll *r)
> +{
> +	u32 ret, head, tail, next, slots, mask;
> +	void *ptr;
> +
> +	do {
> +		head = READ_ONCE(r->cons_head);
> +		mask = READ_ONCE(r->cons_mask);
> +		tail = READ_ONCE(r->prod_tail);  

Like wise the consumer is reading the producer tail (for the empty check).

> +
> +		slots = tail - head;
> +		if (slots < 1)
> +			return ERR_PTR(-ENOMEM);
> +
> +		next = head + 1;
> +		ret = cmpxchg(&r->cons_head, head, next);
> +	} while (ret != head);
> +
> +	ptr = r->queue[head & mask];
> +	smp_rmb();
> +
> +	while (r->cons_tail != head)
> +		cpu_relax();
> +
> +	r->cons_tail = next;
> +	return ptr;
> +}
> +
> +static inline void **__ptr_ring_ll_init_queue_alloc(int size, gfp_t gfp)
> +{
> +	return kzalloc(ALIGN(size * sizeof(void *), SMP_CACHE_BYTES), gfp);
> +}
> +
> +static inline int ptr_ring_ll_init(struct ptr_ring_ll *r, int size, gfp_t gfp)
> +{
> +	r->queue = __ptr_ring_init_queue_alloc(size, gfp);
> +	if (!r->queue)
> +		return -ENOMEM;
> +
> +	r->prod_size = r->cons_size = size;
> +	r->prod_mask = r->cons_mask = size - 1;  

Shouldn't we have some check like is_power_of_2(size), as this code
looks like it depend on this.

> +	r->prod_tail = r->prod_head = 0;
> +	r->cons_tail = r->prod_tail = 0;
> +
> +	return 0;
> +}
> +  
[...]
> +#endif /* _LINUX_PTR_RING_LL_H  */
> diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
> index f4dfade..9b43dfd 100644
> --- a/include/linux/skb_array.h
> +++ b/include/linux/skb_array.h  
[...]
>  
> +static inline int skb_array_ll_produce(struct skb_array_ll *a, struct sk_buff *skb)
> +{
> +	return __ptr_ring_ll_produce(&a->ring, skb);
> +}
> +  
[...]
>  
> +static inline struct sk_buff *skb_array_ll_consume(struct skb_array_ll *a)
> +{
> +	return __ptr_ring_ll_consume(&a->ring);
> +}
> +  

Note in the Multi Producer Multi Consumer (MPMC) use-case this type of
queue can be faster than normal ptr_ring.  And in patch2 you implement
bulking, which is where the real benefit shows (in the MPMC case) for
this kind of queue.

What I would really like to see is a lock-free (locked cmpxchg) queue
implementation, what like ptr_ring use the array as empty/full check,
and still (somehow) support bulking.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-11-16  5:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-11  4:43 [RFC PATCH 0/2] illustrate cmpxchg ring for tap/tun and qdisc John Fastabend
2016-11-11  4:44 ` [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings John Fastabend
2016-11-14 11:09   ` Jesper Dangaard Brouer
2016-11-14 23:01   ` Michael S. Tsirkin
2016-11-16  4:30     ` John Fastabend
2016-11-11  4:44 ` [RFC PATCH 2/2] ptr_ring_ll: pop/push multiple objects at once John Fastabend
2016-11-14 23:06   ` Michael S. Tsirkin
2016-11-16  4:42     ` John Fastabend
2016-11-16  5:23       ` Michael S. Tsirkin
  -- strict thread matches above, loose matches on Subject: below --
2016-11-15 13:32 [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings Jesper Dangaard Brouer
2016-11-15 14:30 ` Michael S. Tsirkin
2016-11-16  4:37 ` John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).