From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap Date: Thu, 12 Apr 2018 10:54:31 +0200 Message-ID: <20180412105431.41bd8d0a@redhat.com> References: <20180327165919.17933-1-bjorn.topel@gmail.com> <20180327165919.17933-4-bjorn.topel@gmail.com> <20180412050542-mutt-send-email-mst@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Cc: "Michael S. Tsirkin" , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , "Duyck, Alexander H" , "alexander.duyck@gmail.com" , "john.fastabend@gmail.com" , "ast@fb.com" , "willemdebruijn.kernel@gmail.com" , "daniel@iogearbox.net" , "netdev@vger.kernel.org" , "michael.lundkvist@ericsson.com" , "Brandeburg, Jesse" , "Singhai, Anjali" , "Zhang, Qi Z" , "ravineet.singh@ericsson.com" , brouer@redhat.com To: "Karlsson, Magnus" Return-path: Received: from mx3-rdu2.redhat.com ([66.187.233.73]:58300 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750780AbeDLIyi (ORCPT ); Thu, 12 Apr 2018 04:54:38 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Thu, 12 Apr 2018 07:38:25 +0000 "Karlsson, Magnus" wrote: > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst@redhat.com] > > Sent: Thursday, April 12, 2018 4:16 AM > > To: Björn Töpel > > Cc: Karlsson, Magnus ; Duyck, Alexander H > > ; alexander.duyck@gmail.com; > > john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com; > > willemdebruijn.kernel@gmail.com; daniel@iogearbox.net; > > netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg, > > Jesse ; Singhai, Anjali > > ; Zhang, Qi Z ; > > ravineet.singh@ericsson.com > > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and > > mmap > > > > On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote: > > > @@ -30,4 +31,18 @@ struct xdp_umem_reg { > > > __u32 frame_headroom; /* Frame head room */ }; > > > > > > +/* Pgoff for mmaping the rings */ > > > +#define XDP_UMEM_PGOFF_FILL_QUEUE 0x100000000 > > > + > > > +struct xdp_queue { > > > + __u32 head_idx __attribute__((aligned(64))); > > > + __u32 tail_idx __attribute__((aligned(64))); }; > > > + > > > +/* Used for the fill and completion queues for buffers */ struct > > > +xdp_umem_queue { > > > + struct xdp_queue ptrs; > > > + __u32 desc[0] __attribute__((aligned(64))); }; > > > + > > > #endif /* _LINUX_IF_XDP_H */ > > > > So IIUC it's a head/tail ring of 32 bit descriptors. > > > > In my experience (from implementing ptr_ring) this implies that head/tail > > cache lines bounce a lot between CPUs. Caching will help some. You are also > > forced to use barriers to check validity which is slow on some architectures. > > > > If instead you can use a special descriptor value (e.g. 0) as a valid signal, > > things work much better: > > > > - you read descriptor atomically, if it's not 0 it's fine > > - same with write - write 0 to pass it to the other side > > - there is a data dependency so no need for barriers (except on dec alpha) > > - no need for power of 2 limitations, you can make it any size you like > > - easy to resize too > > > > architecture (if not implementation) would be shared with ptr_ring so some > > of the optimization ideas like batched updates could be lifted from there. > > > > When I was building ptr_ring, any head/tail design underperformed storing > > valid flag with data itself. YMMV. I fully agree with MST here. This is also my experience. I even dropped my own Array-based Lock-Free (ALF) queue implementation[1] in favor of ptr_ring. (Where I try to amortize this cost by bulking, but this cause the queue to become non-wait-free) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h > I think you are definitely right in that there are ways in which > we can improve performance here. That said, the current queue > performs slightly better than the previous one we had that was > more or less a copy of one of your first virtio 1.1 proposals > from little over a year ago. It had bidirectional queues and a > valid flag in the descriptor itself. The reason we abandoned this > was not poor performance (it was good), but a need to go to > unidirectional queues. Maybe I should have only changed that > aspect and kept the valid flag. > > Anyway, I will take a look at ptr_ring and run some experiments > along the lines of what you propose to get some > numbers. Considering your experience with these kind of > structures, you are likely right. I just need to convince > myself :-). When benchmarking, be careful that you don't measure the "wrong" queue situation. When doing this kind of "overload" benchmarking, you will likely create a situation where the queue is always full (which hopefully isn't a production use-case). In the almost/always full queue situation, using the element values to sync-on (like MST propose) will still cause the cache-line bouncing (that we want to avoid). MST explain and have addressed this situation for ptr_ring in: commit fb9de9704775 ("ptr_ring: batch ring zeroing") https://git.kernel.org/torvalds/c/fb9de9704775 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer