From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer via iovisor-dev Subject: Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Date: Mon, 12 Sep 2016 10:56:55 +0200 Message-ID: <20160912105655.0cb5607e@redhat.com> References: <1473252152-11379-1-git-send-email-saeedm@mellanox.com> <1473252152-11379-12-git-send-email-saeedm@mellanox.com> <20160908101147.1b351432@redhat.com> <20160909032202.GA62966@ast-mbp.thefacebook.com> <20160909073652.351d76d7@redhat.com> <20160909063048.GA67375@ast-mbp.thefacebook.com> Reply-To: Jesper Dangaard Brouer Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Tom Herbert , iovisor-dev , Jamal Hadi Salim , Saeed Mahameed , Eric Dumazet , netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Edward Cree To: Alexei Starovoitov Return-path: In-Reply-To: <20160909063048.GA67375-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iovisor-dev-bounces-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org Errors-To: iovisor-dev-bounces-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org List-Id: netdev.vger.kernel.org On Thu, 8 Sep 2016 23:30:50 -0700 Alexei Starovoitov wrote: > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote: > > > > Lets do bundling/bulking from the start! > > > > > > mlx4 already does bulking and this proposed mlx5 set of patches > > > does bulking as well. > > > See nothing wrong about it. RX side processes the packets and > > > when it's done it tells TX to xmit whatever it collected. > > > > This is doing "hidden" bulking and not really taking advantage of using > > the icache more effeciently. > > > > Let me explain the problem I see, little more clear then, so you > > hopefully see where I'm going. > > > > Imagine you have packets intermixed towards the stack and XDP_TX. > > Every time you call the stack code, then you flush your icache. When > > returning to the driver code, you will have to reload all the icache > > associated with the XDP_TX, this is a costly operation. > > correct. And why is that a problem? It is good that you can see and acknowledge the I-cache problem. XDP is all about performance. What I hear is, that you are arguing against a model that will yield better performance, that does not make sense to me. Let me explain this again, in another way. This is a mental model switch. Stop seeing the lowest driver RX as something that works on a per packet basis. Maybe is it is easier to understand if we instead see this as vector processing? This is about having a vector of packets, where we apply some action/operation. This is about using the CPU more efficiently, getting it to do more instructions per cycle (directly measurable with perf, while I-cache is not directly measurable). Lets assume everything fits into the I-cache (XDP+driver code). The CPU-frontend still have to decode the instructions from the I-cache into micro-ops. The next level of optimizations is to reuse the decoded I-cache by running it on all elements in the packet-vector. The Intel "64 and IA-32 Architectures Optimization Reference Manual" (section 3.4.2.6 "Optimization for Decoded ICache"[1][2]), states make sure each hot code block is less than about 500 instructions. Thus, the different "stages" working on the packet-vector, need to be rather small and compact. [1] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html [2] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Notice: The same mental model switch applies to delivery packets to the regular netstack. I've brought this up before[3]. Instead of flushing the drivers I-cache for every packet, by calling the stack, let instead bundle up N-packets in the driver before calling the stack. I showed 10% speedup by a naive implementation of this approach. Edward Cree also showed[4] a 10% performance boost, and went further into the stack, showing a 25% increase. A goal is also, to make optimizing netstack code-size independent of the driver code-size, by separating the netstacks I-cache usage from the drivers. [3] http://lists.openwall.net/netdev/2016/01/15/51 [4] http://lists.openwall.net/netdev/2016/04/19/89 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer