From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Fastabend Subject: Re: [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings Date: Tue, 15 Nov 2016 20:37:20 -0800 Message-ID: <582BE280.7030306@gmail.com> References: <20161115143258.2c46fc9a@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Cc: "Michael S. Tsirkin" To: Jesper Dangaard Brouer , "netdev@vger.kernel.org" Return-path: Received: from mail-pg0-f65.google.com ([74.125.83.65]:36617 "EHLO mail-pg0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932375AbcKPEhf (ORCPT ); Tue, 15 Nov 2016 23:37:35 -0500 Received: by mail-pg0-f65.google.com with SMTP id x23so13562113pgx.3 for ; Tue, 15 Nov 2016 20:37:35 -0800 (PST) In-Reply-To: <20161115143258.2c46fc9a@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On 16-11-15 05:32 AM, Jesper Dangaard Brouer wrote: > > (looks like my message didn't reach the netdev list, due to me sending > from the wrong email, forwarded message again): > > On Thu, 10 Nov 2016 20:44:08 -0800 John Fastabend wrote: > >> --- >> include/linux/ptr_ring_ll.h | 136 +++++++++++++++++++++++++++++++++++++++++++ >> include/linux/skb_array.h | 25 ++++++++ >> 2 files changed, 161 insertions(+) >> create mode 100644 include/linux/ptr_ring_ll.h >> >> diff --git a/include/linux/ptr_ring_ll.h b/include/linux/ptr_ring_ll.h >> new file mode 100644 >> index 0000000..bcb11f3 >> --- /dev/null >> +++ b/include/linux/ptr_ring_ll.h >> @@ -0,0 +1,136 @@ >> +/* >> + * Definitions for the 'struct ptr_ring_ll' datastructure. >> + * >> + * Author: >> + * John Fastabend > [...] >> + * >> + * This is a limited-size FIFO maintaining pointers in FIFO order, with >> + * one CPU producing entries and another consuming entries from a FIFO. >> + * extended from ptr_ring_ll to use cmpxchg over spin lock. > > It sounds like this is Single Producer Single Consumer (SPSC) > implementation, but your implementation actually is Multi Producer > Multi Consumer (MPMC) capable. Correct qdisc requires a MPMC to handle all the OOO cases. > > The implementation looks a lot like my alf_queue[1] implementation: > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h > Sure, I was using that implementation originally. > If the primary use-case is one CPU producing and another consuming, > then the normal ptr_ring (skb_array) will actually be faster! > > The reason is ptr_ring avoids bouncing a cache-line between the CPUs on > every ring access. This is achieved by having the checks for full > (__ptr_ring_full) and empty (__ptr_ring_empty) use the contents of the > array (NULL value). > > I actually implemented two micro-benchmarks to measure the difference > between skb_array[2] and alf_queue[3]: > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c > [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c > But :) this doesn't jive with my experiments where this implementation was actually giving better numbers with pktgen over pfifo_fast even in the SPSC case. I'll rerun metrics later this week its possible there was some other issue causing the difference I guess. As I noted in Michael's email though really I need to fix a bug in my qdisc code and submit it before I worry too much about this optimization. > >> + */ >> + >> +#ifndef _LINUX_PTR_RING_LL_H >> +#define _LINUX_PTR_RING_LL_H 1 >> + > [...] >> + >> +struct ptr_ring_ll { >> + u32 prod_size; >> + u32 prod_mask; >> + u32 prod_head; >> + u32 prod_tail; >> + u32 cons_size; >> + u32 cons_mask; >> + u32 cons_head; >> + u32 cons_tail; >> + >> + void **queue; >> +}; > > Your implementation doesn't even split the consumer and producer into > different cachelines (which in practice doesn't help much due to how > the empty/full checks are performed). Its was just a implementation to get the qdisc patches off the ground. I expected to follow up with patches to optimize the implementation. [...] >> +static inline int ptr_ring_ll_init(struct ptr_ring_ll *r, int size, gfp_t gfp) >> +{ >> + r->queue = __ptr_ring_init_queue_alloc(size, gfp); >> + if (!r->queue) >> + return -ENOMEM; >> + >> + r->prod_size = r->cons_size = size; >> + r->prod_mask = r->cons_mask = size - 1; > > Shouldn't we have some check like is_power_of_2(size), as this code > looks like it depend on this. > Sure it is required. I was just ensuring callers do it correctly. >> + r->prod_tail = r->prod_head = 0; >> + r->cons_tail = r->prod_tail = 0; >> + >> + return 0; >> +} >> + [...] >> >> +static inline struct sk_buff *skb_array_ll_consume(struct skb_array_ll *a) >> +{ >> + return __ptr_ring_ll_consume(&a->ring); >> +} >> + > > Note in the Multi Producer Multi Consumer (MPMC) use-case this type of > queue can be faster than normal ptr_ring. And in patch2 you implement > bulking, which is where the real benefit shows (in the MPMC case) for > this kind of queue. > > What I would really like to see is a lock-free (locked cmpxchg) queue > implementation, what like ptr_ring use the array as empty/full check, > and still (somehow) support bulking. > OK perhaps worth experimenting with after if I can _finally_ get the qdisc series in. .John