From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache Date: Mon, 23 Apr 2018 10:52:03 +0200 Message-ID: <20180423105203.53600545@redhat.com> References: <890db004-4dfe-7f77-61ee-1ac0d7d2a24c@gmail.com> <1524071712.2599.60.camel@redhat.com> <3270c995-4eea-b3e1-128c-82921d89eb79@gmail.com> <1524123637.3160.16.camel@redhat.com> <0e3abeb5-8081-f9ea-4de6-cc1a7edfc5a5@gmail.com> <20180420154836.3690a39e@redhat.com> <1524396178.10317.18.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Eric Dumazet , Willem de Bruijn , netdev@vger.kernel.org, "David S. Miller" , Tariq Toukan , brouer@redhat.com To: Paolo Abeni Return-path: Received: from mx3-rdu2.redhat.com ([66.187.233.73]:58784 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754292AbeDWIwI (ORCPT ); Mon, 23 Apr 2018 04:52:08 -0400 In-Reply-To: <1524396178.10317.18.camel@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Sun, 22 Apr 2018 13:22:58 +0200 Paolo Abeni wrote: > On Fri, 2018-04-20 at 15:48 +0200, Jesper Dangaard Brouer wrote: > > On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet wrote: > > > On 04/19/2018 12:40 AM, Paolo Abeni wrote: > > > > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote: > > > > > On 04/18/2018 10:15 AM, Paolo Abeni wrote: > > > > [...] > > > > > > > > Any suggestions for better results are more than welcome! > > > > > > Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in > > > Seoul (netdev conference). Not tied to UDP, but a generic solution. > > > > Yes, I remember. I think... was it the idea, where you basically > > wanted to queue back SKBs to the CPU that allocated them, right? > > > > Freeing an SKB on the same CPU that allocated it, have multiple > > advantages. (1) the SLUB allocator can use a non-atomic > > "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of > > the SKB stay local. (3) the atomic SKB refcnt/users stay local. > > By the time the skb is returned to the ingress cpu, isn't that skb most > probably out of the cache? This is a too simplistic view. You have to look at the cache coherence state[1] of the individual cache lines (SKB consist of 4 cache-lines). And newer Intel CPUs [2] can "Forward(F)" cache-lines between caches. The SKB cache-line that have atomic refcnt/users important to analyze (Read For Ownership (RFO) case). Analyzing the other cache-lines is actually more complicated due to techniques like "Store Buffer" and "Invalidate Queues". [1] https://en.wikipedia.org/wiki/MESI_protocol [2] https://en.wikipedia.org/wiki/MESIF_protocol There is also a lot of detail in point (1) about how the SLUB alloactor works internally, and how it avoids bouncing the struct-page cache-line. Some of the performance benefit from you current patch also comes from this... > > We just have to avoid that queue back SKB's mechanism, doesn't cost > > more than the operations we expect to save. Bulk transfer is an > > obvious approach. For storing SKBs until they are returned, we already > > have a fast mechanism see napi_consume_skb calling _kfree_skb_defer, > > which SLUB/SLAB-bulk free to amortize cost (1). > > > > I guess, the missing information is that we don't know what CPU the SKB > > were created on... > > > > Where to store this CPU info? > > > > (a) In struct sk_buff, in a cache-line that is already read on remote > > CPU in UDP code? > > > > (b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page > > basis, we could have SLUB store a hint about the CPU it was allocated > > on, and bet on returning to that CPU ? (might be bad to read the > > struct-page cache-line) > > Bulking would be doable only for connected sockets, elsewhere would be > difficult to assemble a burst long enough to amortize the handshake > with the remote CPU (spinlock + ipi needed ?!?) We obviously need some level of bulking. I would likely try to avoid any explicit IPI calls, but instead use a queue like the ptr_ring queue, because it have good separation between cache-lines used by consumer and producer (but it might be overkill for this use-case). > Would be good enough for unconnected sockets sending a whole skb burst > back to one of the (several) ingress CPU? e.g. peeking the CPU > associated with the first skb inside the burst, we would somewhat > balance the load between the ingress CPUs. See, Willem de Bruijn suggestions... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer