From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id ACE636B0038 for ; Fri, 2 Dec 2016 07:13:50 -0500 (EST) Received: by mail-qk0-f197.google.com with SMTP id y205so203739703qkb.4 for ; Fri, 02 Dec 2016 04:13:50 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id t44si2783775qtc.169.2016.12.02.04.13.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Dec 2016 04:13:49 -0800 (PST) Date: Fri, 2 Dec 2016 13:13:44 +0100 From: Jesper Dangaard Brouer Subject: Re: Initial thoughts on TXDP Message-ID: <20161202131344.12ce594c@redhat.com> In-Reply-To: References: <20161201024407.GE26507@breakpoint.cc> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Tom Herbert Cc: brouer@redhat.com, Florian Westphal , Linux Kernel Network Developers , linux-mm On Thu, 1 Dec 2016 11:51:42 -0800 Tom Herbert wrote: > On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal wrote: > > Tom Herbert wrote: [...] > >> - Call into TCP/IP stack with page data directly from driver-- no > >> skbuff allocation or interface. This is essentially provided by the > >> XDP API although we would need to generalize the interface to call > >> stack functions (I previously posted patches for that). We will also > >> need a new action, XDP_HELD?, that indicates the XDP function held the > >> packet (put on a socket for instance). > > > > Seems this will not work at all with the planned page pool thing when > > pages start to be held indefinitely. It is quite the opposite, the page pool support pages are being held for longer times, than drivers today. The current driver page recycle tricks cannot, as they depend on page refcnt being decremented quickly (while pages are still mapped in their recycle queue). > > You can also never get even close to userspace offload stacks once you > > need/do this; allocations in hotpath are too expensive. Yes. It is important to understand that once the number of outstanding pages get large, the driver recycle stops working. Meaning the pages allocations start to go through the page allocator. I've documented[1] that the bare alloc+free cost[2] (231 cycles order-0/4K) is higher than the 10G wirespeed budget (201 cycles). Thus, the driver recycle tricks are nice for benchmarking, as it hides the page allocator overhead. But this optimization might disappear for Tom's and Eric's more real-world use-cases e.g. like 10.000 sockets. The page pool don't these issues. [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id 0FE536B0253 for ; Fri, 2 Dec 2016 08:01:10 -0500 (EST) Received: by mail-qk0-f200.google.com with SMTP id m67so204614858qkf.0 for ; Fri, 02 Dec 2016 05:01:10 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id s185si2865002qkc.314.2016.12.02.05.01.08 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Dec 2016 05:01:08 -0800 (PST) Date: Fri, 2 Dec 2016 14:01:02 +0100 From: Jesper Dangaard Brouer Subject: Re: Initial thoughts on TXDP Message-ID: <20161202140102.1d515e0b@redhat.com> In-Reply-To: <859a0c99-f427-1db8-d260-1297777792fb@stressinduktion.org> References: <20161201024407.GE26507@breakpoint.cc> <859a0c99-f427-1db8-d260-1297777792fb@stressinduktion.org> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hannes Frederic Sowa Cc: brouer@redhat.com, Tom Herbert , Florian Westphal , Linux Kernel Network Developers , Alexander Duyck , John Fastabend , linux-mm On Thu, 1 Dec 2016 23:47:44 +0100 Hannes Frederic Sowa wrote: > Side note: > > On 01.12.2016 20:51, Tom Herbert wrote: > >> > E.g. "mini-skb": Even if we assume that this provides a speedup > >> > (where does that come from? should make no difference if a 32 or > >> > 320 byte buffer gets allocated). Yes, the size of the allocation from the SLUB allocator does not change base performance/cost much (at least for small objects, if < 1024). Do notice the base SLUB alloc+free cost is fairly high (compared to a 201 cycles budget). Especially for networking as the free-side is very likely to hit a slow path. SLUB fast-path 53 cycles, and slow-path around 100 cycles (data from [1]). I've tried to address this with the kmem_cache bulk APIs. Which reduce the cost to approx 30 cycles. (Something we have not fully reaped the benefit from yet!) [1] https://git.kernel.org/torvalds/c/ca257195511 > >> > > > It's the zero'ing of three cache lines. I believe we talked about that > > as netdev. Actually 4 cache-lines, but with some cleanup I believe we can get down to clearing 192 bytes 3 cache-lines. > > Jesper and me played with that again very recently: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590 > > In micro-benchmarks we saw a pretty good speed up not using the rep > stosb generated by gcc builtin but plain movq's. Probably the cost model > for __builtin_memset in gcc is wrong? Yes, I believe so. > When Jesper is free we wanted to benchmark this and maybe come up with a > arch specific way of cleaning if it turns out to really improve throughput. > > SIMD instructions seem even faster but the kernel_fpu_begin/end() kill > all the benefits. One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz), Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle, or 32 cycles for 256 bytes. Talking to Alex and John during netdev, and reading on the Intel arch, I though that this CPU should be-able-to perform 16 bytes per cycle. The CPU can do it as the rep-stos show this once the size gets large enough. On this CPU the memset rep stos starts to win around 512 bytes: 192/35 = 5.5 bytes/cycle 256/36 = 7.1 bytes/cycle 512/40 = 12.8 bytes/cycle 768/46 = 16.7 bytes/cycle 1024/52 = 19.7 bytes/cycle 2048/84 = 24.4 bytes/cycle 4096/148= 27.7 bytes/cycle -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org