From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f173.google.com (mail-qk0-f173.google.com [209.85.220.173]) by kanga.kvack.org (Postfix) with ESMTP id 643C76B0038 for ; Sun, 23 Aug 2015 20:58:20 -0400 (EDT) Received: by qkbm65 with SMTP id m65so59264523qkb.2 for ; Sun, 23 Aug 2015 17:58:20 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id v9si5958184qkv.29.2015.08.23.17.58.19 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 23 Aug 2015 17:58:19 -0700 (PDT) Subject: [PATCH V2 0/3] slub: introducing detached freelist From: Jesper Dangaard Brouer Date: Mon, 24 Aug 2015 02:58:15 +0200 Message-ID: <20150824005727.2947.36065.stgit@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Christoph Lameter , akpm@linux-foundation.org Cc: aravinda@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com, "Paul E. McKenney" , linux-kernel@vger.kernel.org, Jesper Dangaard Brouer REPOST: * Only updated comment in patch01 per request of Christoph Lameter. * No other objections have been made * Prev post: http://thread.gmane.org/gmane.linux.kernel.mm/135704 NEW use-cases for this API is RCU-free (and still for network NICs). Introducing what I call detached freelist, for improving the performance of object freeing in the "slowpath" of kmem_cache_free_bulk, which calls __slab_free(). The benchmarking tool are avail here: https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm See: slab_bulk_test0{1,2,3}.c Compared against existing bulk-API (in AKPMs tree), we see a small regression for small size bulking (between 2-5 cycles), but a huge improvement for the slowpath. bulk- Bulk-API-before - Bulk-API with patchset 1 - 42 cycles(tsc) 10.520 ns - 47 cycles(tsc) 11.931 ns - improved -11.9% 2 - 26 cycles(tsc) 6.697 ns - 29 cycles(tsc) 7.368 ns - improved -11.5% 3 - 22 cycles(tsc) 5.589 ns - 24 cycles(tsc) 6.003 ns - improved -9.1% 4 - 19 cycles(tsc) 4.921 ns - 22 cycles(tsc) 5.543 ns - improved -15.8% 8 - 17 cycles(tsc) 4.499 ns - 20 cycles(tsc) 5.047 ns - improved -17.6% 16 - 69 cycles(tsc) 17.424 ns - 20 cycles(tsc) 5.015 ns - improved 71.0% 30 - 88 cycles(tsc) 22.075 ns - 20 cycles(tsc) 5.062 ns - improved 77.3% 32 - 83 cycles(tsc) 20.965 ns - 20 cycles(tsc) 5.089 ns - improved 75.9% 34 - 80 cycles(tsc) 20.039 ns - 28 cycles(tsc) 7.006 ns - improved 65.0% 48 - 76 cycles(tsc) 19.252 ns - 31 cycles(tsc) 7.755 ns - improved 59.2% 64 - 86 cycles(tsc) 21.523 ns - 68 cycles(tsc) 17.203 ns - improved 20.9% 128 - 97 cycles(tsc) 24.444 ns - 72 cycles(tsc) 18.195 ns - improved 25.8% 158 - 96 cycles(tsc) 24.036 ns - 73 cycles(tsc) 18.372 ns - improved 24.0% 250 - 100 cycles(tsc) 25.007 ns - 73 cycles(tsc) 18.430 ns - improved 27.0% Patchset based on top of commit aefbef10e3ae with previous accepted bulk patchset(V2) applied (avail in AKPMs quilt). Small note, benchmark run with kernel compiled with .config CONFIG_FTRACE in-order to use the perf probes to measure the amount of page bulking into __slab_free(). While running the "worse-case" testing module slab_bulk_test03.c --- Jesper Dangaard Brouer (3): slub: extend slowpath __slab_free() to handle bulk free slub: optimize bulk slowpath free by detached freelist slub: build detached freelist with look-ahead mm/slub.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 112 insertions(+), 30 deletions(-) -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f179.google.com (mail-qk0-f179.google.com [209.85.220.179]) by kanga.kvack.org (Postfix) with ESMTP id D369A6B0038 for ; Sun, 23 Aug 2015 20:58:53 -0400 (EDT) Received: by qkbm65 with SMTP id m65so59269183qkb.2 for ; Sun, 23 Aug 2015 17:58:53 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id f138si25722969qka.28.2015.08.23.17.58.52 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 23 Aug 2015 17:58:53 -0700 (PDT) Subject: [PATCH V2 1/3] slub: extend slowpath __slab_free() to handle bulk free From: Jesper Dangaard Brouer Date: Mon, 24 Aug 2015 02:58:48 +0200 Message-ID: <20150824005823.2947.19259.stgit@localhost> In-Reply-To: <20150824005727.2947.36065.stgit@localhost> References: <20150824005727.2947.36065.stgit@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Christoph Lameter , akpm@linux-foundation.org Cc: aravinda@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com, "Paul E. McKenney" , linux-kernel@vger.kernel.org, Jesper Dangaard Brouer Make it possible to free a freelist with several objects by extending __slab_free() with two arguments: a freelist_head pointer and objects counter (cnt). If freelist_head pointer is set, then the object must be the freelist tail pointer. This allows a freelist with several objects (all within the same slab-page) to be free'ed using a single locked cmpxchg_double. Micro benchmarking showed no performance reduction due to this change. Signed-off-by: Jesper Dangaard Brouer --- V2: Per request of Christoph Lameter * Made it more clear that freelist objs must exist within same page mm/slub.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index c9305f525004..10b57a3bb895 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2573,9 +2573,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); * So we still attempt to reduce cache line usage. Just take the slab * lock and free the item. If there is no additional partial page * handling required then we can return immediately. + * + * Bulk free of a freelist with several objects (all pointing to the + * same page) possible by specifying freelist_head ptr and object as + * tail ptr, plus objects count (cnt). */ static void __slab_free(struct kmem_cache *s, struct page *page, - void *x, unsigned long addr) + void *x, unsigned long addr, + void *freelist_head, int cnt) { void *prior; void **object = (void *)x; @@ -2584,6 +2589,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, unsigned long counters; struct kmem_cache_node *n = NULL; unsigned long uninitialized_var(flags); + void *new_freelist = (!freelist_head) ? object : freelist_head; stat(s, FREE_SLOWPATH); @@ -2601,7 +2607,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, set_freepointer(s, object, prior); new.counters = counters; was_frozen = new.frozen; - new.inuse--; + new.inuse -= cnt; if ((!new.inuse || !prior) && !was_frozen) { if (kmem_cache_has_cpu_partial(s) && !prior) { @@ -2632,7 +2638,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } while (!cmpxchg_double_slab(s, page, prior, counters, - object, new.counters, + new_freelist, new.counters, "__slab_free")); if (likely(!n)) { @@ -2736,7 +2742,7 @@ redo: } stat(s, FREE_FASTPATH); } else - __slab_free(s, page, x, addr); + __slab_free(s, page, x, addr, NULL, 1); } @@ -2780,7 +2786,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) c->tid = next_tid(c->tid); local_irq_enable(); /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_); + __slab_free(s, page, object, _RET_IP_, NULL, 1); local_irq_disable(); c = this_cpu_ptr(s->cpu_slab); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f175.google.com (mail-qk0-f175.google.com [209.85.220.175]) by kanga.kvack.org (Postfix) with ESMTP id 6B8716B0254 for ; Sun, 23 Aug 2015 20:59:08 -0400 (EDT) Received: by qkbm65 with SMTP id m65so59271194qkb.2 for ; Sun, 23 Aug 2015 17:59:08 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id a108si25759486qga.15.2015.08.23.17.59.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 23 Aug 2015 17:59:07 -0700 (PDT) Subject: [PATCH V2 2/3] slub: optimize bulk slowpath free by detached freelist From: Jesper Dangaard Brouer Date: Mon, 24 Aug 2015 02:59:04 +0200 Message-ID: <20150824005857.2947.51229.stgit@localhost> In-Reply-To: <20150824005727.2947.36065.stgit@localhost> References: <20150824005727.2947.36065.stgit@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Christoph Lameter , akpm@linux-foundation.org Cc: aravinda@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com, "Paul E. McKenney" , linux-kernel@vger.kernel.org, Jesper Dangaard Brouer This change focus on improving the speed of object freeing in the "slowpath" of kmem_cache_free_bulk. The slowpath call __slab_free() have been extended with support for bulk free, which amortize the overhead of the locked cmpxchg_double_slab. To use the new bulking feature of __slab_free(), we build what I call a detached freelist. The detached freelist takes advantage of three properties: 1) the free function call owns the object that is about to be freed, thus writing into this memory is synchronization-free. 2) many freelist's can co-exist side-by-side in the same page each with a separate head pointer. 3) it is the visibility of the head pointer that needs synchronization. Given these properties, the brilliant part is that the detached freelist can be constructed without any need for synchronization. The freelist is constructed directly in the page objects, without any synchronization needed. The detached freelist is allocated on the stack of the function call kmem_cache_free_bulk. Thus, the freelist head pointer is not visible to other CPUs. This implementation is fairly simple, as it only builds the detached freelist if two consecutive objects belongs to the same page. When detecting object page does not match, it simply flushes the local freelist, and starts a new local detached freelist. It will not look-ahead to see if further opputunities exists in the The next patch have a more advanced look-ahead approach, but is also more complicated. Splitting them up, because I want to be able to benchmark the simple against the advanced approach. Signed-off-by: Jesper Dangaard Brouer --- bulk- Fallback - Bulk API 1 - 64 cycles(tsc) 16.109 ns - 47 cycles(tsc) 11.894 - improved 26.6% 2 - 56 cycles(tsc) 14.158 ns - 45 cycles(tsc) 11.274 - improved 19.6% 3 - 54 cycles(tsc) 13.650 ns - 23 cycles(tsc) 6.001 - improved 57.4% 4 - 53 cycles(tsc) 13.268 ns - 21 cycles(tsc) 5.262 - improved 60.4% 8 - 51 cycles(tsc) 12.841 ns - 18 cycles(tsc) 4.718 - improved 64.7% 16 - 50 cycles(tsc) 12.583 ns - 19 cycles(tsc) 4.896 - improved 62.0% 30 - 85 cycles(tsc) 21.357 ns - 26 cycles(tsc) 6.549 - improved 69.4% 32 - 82 cycles(tsc) 20.690 ns - 25 cycles(tsc) 6.412 - improved 69.5% 34 - 81 cycles(tsc) 20.322 ns - 25 cycles(tsc) 6.365 - improved 69.1% 48 - 93 cycles(tsc) 23.332 ns - 28 cycles(tsc) 7.139 - improved 69.9% 64 - 98 cycles(tsc) 24.544 ns - 62 cycles(tsc) 15.543 - improved 36.7% 128 - 96 cycles(tsc) 24.219 ns - 68 cycles(tsc) 17.143 - improved 29.2% 158 - 107 cycles(tsc) 26.817 ns - 69 cycles(tsc) 17.431 - improved 35.5% 250 - 107 cycles(tsc) 26.824 ns - 70 cycles(tsc) 17.730 - improved 34.6% --- mm/slub.c | 48 +++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 41 insertions(+), 7 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 10b57a3bb895..40e4b5926311 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2756,12 +2756,26 @@ void kmem_cache_free(struct kmem_cache *s, void *x) } EXPORT_SYMBOL(kmem_cache_free); +struct detached_freelist { + struct page *page; + void *freelist; + void *tail_object; + int cnt; +}; + /* Note that interrupts must be enabled when calling this function. */ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) { struct kmem_cache_cpu *c; struct page *page; int i; + /* Opportunistically delay updating page->freelist, hoping + * next free happen to same page. Start building the freelist + * in the page, but keep local stack ptr to freelist. If + * successful several object can be transferred to page with a + * single cmpxchg_double. + */ + struct detached_freelist df = {0}; local_irq_disable(); c = this_cpu_ptr(s->cpu_slab); @@ -2778,22 +2792,42 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) page = virt_to_head_page(object); - if (c->page == page) { + if (page == df.page) { + /* Oppotunity to delay real free */ + set_freepointer(s, object, df.freelist); + df.freelist = object; + df.cnt++; + } else if (c->page == page) { /* Fastpath: local CPU free */ set_freepointer(s, object, c->freelist); c->freelist = object; } else { - c->tid = next_tid(c->tid); - local_irq_enable(); - /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_, NULL, 1); - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); + /* Slowpath: Flush delayed free */ + if (df.page) { + c->tid = next_tid(c->tid); + local_irq_enable(); + __slab_free(s, df.page, df.tail_object, + _RET_IP_, df.freelist, df.cnt); + local_irq_disable(); + c = this_cpu_ptr(s->cpu_slab); + } + /* Start new round of delayed free */ + df.page = page; + df.tail_object = object; + set_freepointer(s, object, NULL); + df.freelist = object; + df.cnt = 1; } } exit: c->tid = next_tid(c->tid); local_irq_enable(); + + /* Flush detached freelist */ + if (df.page) { + __slab_free(s, df.page, df.tail_object, + _RET_IP_, df.freelist, df.cnt); + } } EXPORT_SYMBOL(kmem_cache_free_bulk); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f171.google.com (mail-qk0-f171.google.com [209.85.220.171]) by kanga.kvack.org (Postfix) with ESMTP id D484F6B0255 for ; Sun, 23 Aug 2015 20:59:31 -0400 (EDT) Received: by qkfh127 with SMTP id h127so59383777qkf.1 for ; Sun, 23 Aug 2015 17:59:31 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id g6si25688840qkh.105.2015.08.23.17.59.30 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 23 Aug 2015 17:59:31 -0700 (PDT) Subject: [PATCH V2 3/3] slub: build detached freelist with look-ahead From: Jesper Dangaard Brouer Date: Mon, 24 Aug 2015 02:59:27 +0200 Message-ID: <20150824005911.2947.50857.stgit@localhost> In-Reply-To: <20150824005727.2947.36065.stgit@localhost> References: <20150824005727.2947.36065.stgit@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Christoph Lameter , akpm@linux-foundation.org Cc: aravinda@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com, "Paul E. McKenney" , linux-kernel@vger.kernel.org, Jesper Dangaard Brouer This change is a more advanced use of detached freelist. The bulk free array is scanned is a progressive manor with a limited look-ahead facility. To maintain the same performance level, as the previous simple implementation, the look-ahead have been limited to only 3 objects. This number have been determined my experimental micro benchmarking. For performance the free loop in kmem_cache_free_bulk have been significantly reorganized, with a focus on making the branches more predictable for the compiler. E.g. the per CPU c->freelist is also build as a detached freelist, even-though it would be just as fast as freeing directly to it, but it save creating an unpredictable branch. Another benefit of this change is that kmem_cache_free_bulk() runs mostly with IRQs enabled. The local IRQs are only disabled when updating the per CPU c->freelist. This should please Thomas Gleixner. Pitfall(1): Removed kmem debug support. Pitfall(2): No BUG_ON() freeing NULL pointers, but the algorithm handles and skips these NULL pointers. Compare against previous patch: There is some fluctuation in the benchmarks between runs. To counter this I've run some specific[1] bulk sizes, repeated 100 times and run dmesg through Rusty's "stats"[2] tool. Command line: sudo dmesg -c ;\ for x in `seq 100`; do \ modprobe slab_bulk_test02 bulksz=48 loops=100000 && rmmod slab_bulk_test02; \ echo $x; \ sleep 0.${RANDOM} ;\ done; \ dmesg | stats Results: bulk size:16, average: +2.01 cycles Prev: between 19-52 (average: 22.65 stddev:+/-6.9) This: between 19-67 (average: 24.67 stddev:+/-9.9) bulk size:48, average: +1.54 cycles Prev: between 23-45 (average: 27.88 stddev:+/-4) This: between 24-41 (average: 29.42 stddev:+/-3.7) bulk size:144, average: +1.73 cycles Prev: between 44-76 (average: 60.31 stddev:+/-7.7) This: between 49-80 (average: 62.04 stddev:+/-7.3) bulk size:512, average: +8.94 cycles Prev: between 50-68 (average: 60.11 stddev: +/-4.3) This: between 56-80 (average: 69.05 stddev: +/-5.2) bulk size:2048, average: +26.81 cycles Prev: between 61-73 (average: 68.10 stddev:+/-2.9) This: between 90-104(average: 94.91 stddev:+/-2.1) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test02.c [2] https://github.com/rustyrussell/stats Signed-off-by: Jesper Dangaard Brouer --- bulk- Fallback - Bulk API 1 - 64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6% 2 - 57 cycles(tsc) 14.397 ns - 29 cycles(tsc) 7.368 - improved 49.1% 3 - 55 cycles(tsc) 13.797 ns - 24 cycles(tsc) 6.003 - improved 56.4% 4 - 53 cycles(tsc) 13.500 ns - 22 cycles(tsc) 5.543 - improved 58.5% 8 - 52 cycles(tsc) 13.008 ns - 20 cycles(tsc) 5.047 - improved 61.5% 16 - 51 cycles(tsc) 12.763 ns - 20 cycles(tsc) 5.015 - improved 60.8% 30 - 50 cycles(tsc) 12.743 ns - 20 cycles(tsc) 5.062 - improved 60.0% 32 - 51 cycles(tsc) 12.908 ns - 20 cycles(tsc) 5.089 - improved 60.8% 34 - 87 cycles(tsc) 21.936 ns - 28 cycles(tsc) 7.006 - improved 67.8% 48 - 79 cycles(tsc) 19.840 ns - 31 cycles(tsc) 7.755 - improved 60.8% 64 - 86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9% 128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7% 158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8% 250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6% --- mm/slub.c | 138 ++++++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 90 insertions(+), 48 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 40e4b5926311..49ae96f45670 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2763,71 +2763,113 @@ struct detached_freelist { int cnt; }; -/* Note that interrupts must be enabled when calling this function. */ -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +/* + * This function extract objects belonging to the same page, and + * builds a detached freelist directly within the given page/objects. + * This can happen without any need for synchronization, because the + * objects are owned by running process. The freelist is build up as + * a single linked list in the objects. The idea is, that this + * detached freelist can then be bulk transferred to the real + * freelist(s), but only requiring a single synchronization primitive. + */ +static inline int build_detached_freelist( + struct kmem_cache *s, size_t size, void **p, + struct detached_freelist *df, int start_index) { - struct kmem_cache_cpu *c; struct page *page; int i; - /* Opportunistically delay updating page->freelist, hoping - * next free happen to same page. Start building the freelist - * in the page, but keep local stack ptr to freelist. If - * successful several object can be transferred to page with a - * single cmpxchg_double. - */ - struct detached_freelist df = {0}; + int lookahead = 0; + void *object; - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); + /* Always re-init detached_freelist */ + do { + object = p[start_index]; + if (object) { + /* Start new delayed freelist */ + df->page = virt_to_head_page(object); + df->tail_object = object; + set_freepointer(s, object, NULL); + df->freelist = object; + df->cnt = 1; + p[start_index] = NULL; /* mark object processed */ + } else { + df->page = NULL; /* Handle NULL ptr in array */ + } + start_index++; + } while (!object && start_index < size); - for (i = 0; i < size; i++) { - void *object = p[i]; + for (i = start_index; i < size; i++) { + object = p[i]; - BUG_ON(!object); - /* kmem cache debug support */ - s = cache_from_obj(s, object); - if (unlikely(!s)) - goto exit; - slab_free_hook(s, object); + if (!object) + continue; /* Skip processed objects */ page = virt_to_head_page(object); - if (page == df.page) { - /* Oppotunity to delay real free */ - set_freepointer(s, object, df.freelist); - df.freelist = object; - df.cnt++; - } else if (c->page == page) { - /* Fastpath: local CPU free */ - set_freepointer(s, object, c->freelist); - c->freelist = object; + /* df->page is always set at this point */ + if (page == df->page) { + /* Oppotunity build freelist */ + set_freepointer(s, object, df->freelist); + df->freelist = object; + df->cnt++; + p[i] = NULL; /* mark object processed */ + if (!lookahead) + start_index++; } else { - /* Slowpath: Flush delayed free */ - if (df.page) { + /* Limit look ahead search */ + if (++lookahead >= 3) + return start_index; + continue; + } + } + return start_index; +} + +/* Note that interrupts must be enabled when calling this function. */ +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +{ + struct kmem_cache_cpu *c; + int iterator = 0; + struct detached_freelist df; + + BUG_ON(!size); + + /* Per CPU ptr may change afterwards */ + c = this_cpu_ptr(s->cpu_slab); + + while (likely(iterator < size)) { + iterator = build_detached_freelist(s, size, p, &df, iterator); + if (likely(df.page)) { + redo: + if (c->page == df.page) { + /* + * Local CPU free require disabling + * IRQs. It is possible to miss the + * oppotunity and instead free to + * page->freelist, but it does not + * matter as page->freelist will + * eventually be transferred to + * c->freelist + */ + local_irq_disable(); + c = this_cpu_ptr(s->cpu_slab); /* reload */ + if (c->page != df.page) { + local_irq_enable(); + goto redo; + } + /* Bulk transfer to CPU c->freelist */ + set_freepointer(s, df.tail_object, c->freelist); + c->freelist = df.freelist; + c->tid = next_tid(c->tid); local_irq_enable(); + } else { + /* Bulk transfer to page->freelist */ __slab_free(s, df.page, df.tail_object, _RET_IP_, df.freelist, df.cnt); - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); } - /* Start new round of delayed free */ - df.page = page; - df.tail_object = object; - set_freepointer(s, object, NULL); - df.freelist = object; - df.cnt = 1; } } -exit: - c->tid = next_tid(c->tid); - local_irq_enable(); - - /* Flush detached freelist */ - if (df.page) { - __slab_free(s, df.page, df.tail_object, - _RET_IP_, df.freelist, df.cnt); - } } EXPORT_SYMBOL(kmem_cache_free_bulk); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id C852B6B0254 for ; Fri, 4 Sep 2015 13:00:38 -0400 (EDT) Received: by padhy16 with SMTP id hy16so27250287pad.1 for ; Fri, 04 Sep 2015 10:00:38 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id ol6si5238119pab.37.2015.09.04.10.00.37 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 04 Sep 2015 10:00:37 -0700 (PDT) Subject: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. From: Jesper Dangaard Brouer Date: Fri, 04 Sep 2015 19:00:34 +0200 Message-ID: <20150904165944.4312.32435.stgit@devil> In-Reply-To: <20150824005727.2947.36065.stgit@localhost> References: <20150824005727.2947.36065.stgit@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: netdev@vger.kernel.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Jesper Dangaard Brouer , aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com During TX DMA completion cleanup there exist an opportunity in the NIC drivers to perform bulk free, without introducing additional latency. For an IPv4 forwarding workload the network stack is hitting the slowpath of the kmem_cache "slub" allocator. This slowpath can be mitigated by bulk free via the detached freelists patchset. Depend on patchset: http://thread.gmane.org/gmane.linux.kernel.mm/137469 Kernel based on MMOTM tag 2015-08-24-16-12 from git repo: git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git Also contains Christoph's patch "slub: Avoid irqoff/on in bulk allocation" Benchmarking: Single CPU IPv4 forwarding UDP (generator pktgen): * Before: 2043575 pps * After : 2090522 pps * Improvements: +46947 pps and -10.99 ns In the before case, perf report shows slub free hits the slowpath: 1.98% ksoftirqd/6 [kernel.vmlinux] [k] __slab_free.isra.72 1.29% ksoftirqd/6 [kernel.vmlinux] [k] cmpxchg_double_slab.isra.71 0.95% ksoftirqd/6 [kernel.vmlinux] [k] kmem_cache_free 0.95% ksoftirqd/6 [kernel.vmlinux] [k] kmem_cache_alloc 0.20% ksoftirqd/6 [kernel.vmlinux] [k] __cmpxchg_double_slab.isra.60 0.17% ksoftirqd/6 [kernel.vmlinux] [k] ___slab_alloc.isra.68 0.09% ksoftirqd/6 [kernel.vmlinux] [k] __slab_alloc.isra.69 After the slowpath calls are almost gone: 0.22% ksoftirqd/6 [kernel.vmlinux] [k] __cmpxchg_double_slab.isra.60 0.18% ksoftirqd/6 [kernel.vmlinux] [k] ___slab_alloc.isra.68 0.14% ksoftirqd/6 [kernel.vmlinux] [k] __slab_free.isra.72 0.14% ksoftirqd/6 [kernel.vmlinux] [k] cmpxchg_double_slab.isra.71 0.08% ksoftirqd/6 [kernel.vmlinux] [k] __slab_alloc.isra.69 Extra info, tuning SLUB per CPU structures gives further improvements: * slub-tuned: 2124217 pps * patched increase: +33695 pps and -7.59 ns * before increase: +80642 pps and -18.58 ns Tuning done: echo 256 > /sys/kernel/slab/skbuff_head_cache/cpu_partial echo 9 > /sys/kernel/slab/skbuff_head_cache/min_partial Without SLUB tuning, same performance comes with kernel cmdline "slab_nomerge": * slab_nomerge: 2121824 pps Test notes: * Notice very fast CPU i7-4790K CPU @ 4.00GHz * gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) * kernel 4.1.0-mmotm-2015-08-24-16-12+ #271 SMP * Generator pktgen UDP single flow (pktgen_sample03_burst_single_flow.sh) * Tuned for forwarding: - unloaded netfilter modules - Sysctl settings: - net/ipv4/conf/default/rp_filter = 0 - net/ipv4/conf/all/rp_filter = 0 - (Forwarding performance is affected by early demux) - net/ipv4/ip_early_demux = 0 - net.ipv4.ip_forward = 1 - Disabled GRO on NICs - ethtool -K ixgbe3 gro off tso off gso off --- Jesper Dangaard Brouer (3): net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() net: NIC helper API for building array of skbs to free ixgbe: bulk free SKBs during TX completion cleanup cycle drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 13 +++- include/linux/netdevice.h | 62 ++++++++++++++++++ include/linux/skbuff.h | 1 net/core/skbuff.c | 87 ++++++++++++++++++++----- 4 files changed, 144 insertions(+), 19 deletions(-) -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id B5C0A6B0255 for ; Fri, 4 Sep 2015 13:00:56 -0400 (EDT) Received: by pacwi10 with SMTP id wi10so29486322pac.3 for ; Fri, 04 Sep 2015 10:00:56 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id il4si5188625pbb.177.2015.09.04.10.00.55 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 04 Sep 2015 10:00:55 -0700 (PDT) Subject: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() From: Jesper Dangaard Brouer Date: Fri, 04 Sep 2015 19:00:53 +0200 Message-ID: <20150904170046.4312.38018.stgit@devil> In-Reply-To: <20150904165944.4312.32435.stgit@devil> References: <20150904165944.4312.32435.stgit@devil> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: netdev@vger.kernel.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Jesper Dangaard Brouer , aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(), in the network stack in form of function kfree_skb_bulk() which bulk free SKBs (not skb clones or skb->head, yet). As this is the third user of SKB reference decrementing, split out refcnt decrement into helper function and use this in all call points. Signed-off-by: Jesper Dangaard Brouer --- include/linux/skbuff.h | 1 + net/core/skbuff.c | 87 +++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 71 insertions(+), 17 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b97597970ce7..e5f1e007723b 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -762,6 +762,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb) } void kfree_skb(struct sk_buff *skb); +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size); void kfree_skb_list(struct sk_buff *segs); void skb_tx_error(struct sk_buff *skb); void consume_skb(struct sk_buff *skb); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 429b407b4fe6..034545934158 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -661,26 +661,83 @@ void __kfree_skb(struct sk_buff *skb) } EXPORT_SYMBOL(__kfree_skb); +/* + * skb_dec_and_test - Helper to drop ref to SKB and see is ready to free + * @skb: buffer to decrement reference + * + * Drop a reference to the buffer, and return true if it is ready + * to free. Which is if the usage count has hit zero or is equal to 1. + * + * This is performance critical code that should be inlined. + */ +static inline bool skb_dec_and_test(struct sk_buff *skb) +{ + if (unlikely(!skb)) + return false; + if (likely(atomic_read(&skb->users) == 1)) + smp_rmb(); + else if (likely(!atomic_dec_and_test(&skb->users))) + return false; + /* If reaching here SKB is ready to free */ + return true; +} + /** * kfree_skb - free an sk_buff * @skb: buffer to free * * Drop a reference to the buffer and free it if the usage count has - * hit zero. + * hit zero or is equal to 1. */ void kfree_skb(struct sk_buff *skb) { - if (unlikely(!skb)) - return; - if (likely(atomic_read(&skb->users) == 1)) - smp_rmb(); - else if (likely(!atomic_dec_and_test(&skb->users))) - return; - trace_kfree_skb(skb, __builtin_return_address(0)); - __kfree_skb(skb); + if (skb_dec_and_test(skb)) { + trace_kfree_skb(skb, __builtin_return_address(0)); + __kfree_skb(skb); + } } EXPORT_SYMBOL(kfree_skb); +/** + * kfree_skb_bulk - bulk free SKBs when refcnt allows to + * @skbs: array of SKBs to free + * @size: number of SKBs in array + * + * If SKB refcnt allows for free, then release any auxiliary data + * and then bulk free SKBs to the SLAB allocator. + * + * Note that interrupts must be enabled when calling this function. + */ +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size) +{ + int i; + size_t cnt = 0; + + for (i = 0; i < size; i++) { + struct sk_buff *skb = skbs[i]; + + if (!skb_dec_and_test(skb)) + continue; /* skip skb, not ready to free */ + + /* Construct an array of SKBs, ready to be free'ed and + * cleanup all auxiliary, before bulk free to SLAB. + * For now, only handle non-cloned SKBs, related to + * SLAB skbuff_head_cache + */ + if (skb->fclone == SKB_FCLONE_UNAVAILABLE) { + skb_release_all(skb); + skbs[cnt++] = skb; + } else { + /* SKB was a clone, don't handle this case */ + __kfree_skb(skb); + } + } + if (likely(cnt)) { + kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs); + } +} +EXPORT_SYMBOL(kfree_skb_bulk); + void kfree_skb_list(struct sk_buff *segs) { while (segs) { @@ -722,14 +779,10 @@ EXPORT_SYMBOL(skb_tx_error); */ void consume_skb(struct sk_buff *skb) { - if (unlikely(!skb)) - return; - if (likely(atomic_read(&skb->users) == 1)) - smp_rmb(); - else if (likely(!atomic_dec_and_test(&skb->users))) - return; - trace_consume_skb(skb); - __kfree_skb(skb); + if (skb_dec_and_test(skb)) { + trace_consume_skb(skb); + __kfree_skb(skb); + } } EXPORT_SYMBOL(consume_skb); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f174.google.com (mail-io0-f174.google.com [209.85.223.174]) by kanga.kvack.org (Postfix) with ESMTP id A66036B0256 for ; Fri, 4 Sep 2015 13:01:09 -0400 (EDT) Received: by iofb144 with SMTP id b144so30846311iof.1 for ; Fri, 04 Sep 2015 10:01:09 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id na7si456183pdb.93.2015.09.04.10.01.08 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 04 Sep 2015 10:01:09 -0700 (PDT) Subject: [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free From: Jesper Dangaard Brouer Date: Fri, 04 Sep 2015 19:01:06 +0200 Message-ID: <20150904170104.4312.47707.stgit@devil> In-Reply-To: <20150904165944.4312.32435.stgit@devil> References: <20150904165944.4312.32435.stgit@devil> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: netdev@vger.kernel.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Jesper Dangaard Brouer , aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com The NIC device drivers are expected to use this small helper API, when building up an array of objects/skbs to bulk free, while (loop) processing objects to free. Objects to be free'ed later is added (dev_free_waitlist_add) to an array and flushed if the array runs full. After processing the array is flushed (dev_free_waitlist_flush). The array should be stored on the local stack. Usage e.g. during TX completion loop the NIC driver can replace dev_consume_skb_any() with an "add" and after the loop a "flush". For performance reasons the compiler should inline most of these functions. Signed-off-by: Jesper Dangaard Brouer --- include/linux/netdevice.h | 62 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 05b9a694e213..d0133e778314 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2935,6 +2935,68 @@ static inline void dev_consume_skb_any(struct sk_buff *skb) __dev_kfree_skb_any(skb, SKB_REASON_CONSUMED); } +/* The NIC device drivers are expected to use this small helper API, + * when building up an array of objects/skbs to bulk free, while + * (loop) processing objects to free. Objects to be free'ed later is + * added (dev_free_waitlist_add) to an array and flushed if the array + * runs full. After processing the array is flushed (dev_free_waitlist_flush). + * The array should be stored on the local stack. + * + * Usage e.g. during TX completion loop the NIC driver can replace + * dev_consume_skb_any() with an "add" and after the loop a "flush". + * + * For performance reasons the compiler should inline most of these + * functions. + */ +struct dev_free_waitlist { + struct sk_buff **skbs; + unsigned int skb_cnt; +}; + +static void __dev_free_waitlist_bulkfree(struct dev_free_waitlist *wl) +{ + /* Cannot bulk free from interrupt context or with IRQs + * disabled, due to how SLAB bulk API works (and gain it's + * speedup). This can e.g. happen due to invocation from + * netconsole/netpoll. + */ + if (unlikely(in_irq() || irqs_disabled())) { + int i; + + for (i = 0; i < wl->skb_cnt; i++) + dev_consume_skb_irq(wl->skbs[i]); + } else { + /* Likely fastpath, don't call with cnt == 0 */ + kfree_skb_bulk(wl->skbs, wl->skb_cnt); + } +} + +static inline void dev_free_waitlist_flush(struct dev_free_waitlist *wl) +{ + /* Flush the waitlist, but only if any objects remain, as bulk + * freeing "zero" objects is not supported and plus it avoids + * pointless function calls. + */ + if (likely(wl->skb_cnt)) + __dev_free_waitlist_bulkfree(wl); +} + +static __always_inline void dev_free_waitlist_add(struct dev_free_waitlist *wl, + struct sk_buff *skb, + unsigned int max) +{ + /* It is recommended that max is a builtin constant, as this + * saves one register when inlined. Catch offenders with: + * BUILD_BUG_ON(!__builtin_constant_p(max)); + */ + wl->skbs[wl->skb_cnt++] = skb; + if (wl->skb_cnt == max) { + /* Detect when waitlist array is full, then flush and reset */ + __dev_free_waitlist_bulkfree(wl); + wl->skb_cnt = 0; + } +} + int netif_rx(struct sk_buff *skb); int netif_rx_ni(struct sk_buff *skb); int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f50.google.com (mail-pa0-f50.google.com [209.85.220.50]) by kanga.kvack.org (Postfix) with ESMTP id B0EF26B0257 for ; Fri, 4 Sep 2015 13:01:23 -0400 (EDT) Received: by pacex6 with SMTP id ex6so29528281pac.0 for ; Fri, 04 Sep 2015 10:01:23 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id ml3si5210970pab.134.2015.09.04.10.01.22 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 04 Sep 2015 10:01:23 -0700 (PDT) Subject: [RFC PATCH 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle From: Jesper Dangaard Brouer Date: Fri, 04 Sep 2015 19:01:21 +0200 Message-ID: <20150904170117.4312.97676.stgit@devil> In-Reply-To: <20150904165944.4312.32435.stgit@devil> References: <20150904165944.4312.32435.stgit@devil> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: netdev@vger.kernel.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Jesper Dangaard Brouer , aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com First user of the SKB bulk free API (namely kfree_skb_bulk() via waitlist helper add-and-flush API). There is an opportunity to bulk free SKBs during reclaiming of resources after DMA transmit completes in ixgbe_clean_tx_irq. Thus, bulk freeing at this point does not introduce any added latency. Choosing bulk size 32 even-though budget usually is 64, due (1) to limit the stack usage and (2) as SLAB behind SKBs have 32 objects per slab. Signed-off-by: Jesper Dangaard Brouer --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 463ff47200f1..d35d6b47bae2 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1075,6 +1075,7 @@ static void ixgbe_tx_timeout_reset(struct ixgbe_adapter *adapter) * @q_vector: structure containing interrupt and ring information * @tx_ring: tx ring to clean **/ +#define BULK_FREE_SIZE 32 static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, struct ixgbe_ring *tx_ring) { @@ -1084,6 +1085,11 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, unsigned int total_bytes = 0, total_packets = 0; unsigned int budget = q_vector->tx.work_limit; unsigned int i = tx_ring->next_to_clean; + struct sk_buff *skbs[BULK_FREE_SIZE]; + struct dev_free_waitlist wl; + + wl.skb_cnt = 0; + wl.skbs = skbs; if (test_bit(__IXGBE_DOWN, &adapter->state)) return true; @@ -1113,8 +1119,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, total_bytes += tx_buffer->bytecount; total_packets += tx_buffer->gso_segs; - /* free the skb */ - dev_consume_skb_any(tx_buffer->skb); + /* delay skb free and bulk free later */ + dev_free_waitlist_add(&wl, tx_buffer->skb, BULK_FREE_SIZE); /* unmap skb header data */ dma_unmap_single(tx_ring->dev, @@ -1164,6 +1170,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, budget--; } while (likely(budget)); + dev_free_waitlist_flush(&wl); /* free remaining SKBs on waitlist */ + i += tx_ring->count; tx_ring->next_to_clean = i; u64_stats_update_begin(&tx_ring->syncp); @@ -1224,6 +1232,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, return !!budget; } +#undef BULK_FREE_SIZE #ifdef CONFIG_IXGBE_DCA static void ixgbe_update_tx_dca(struct ixgbe_adapter *adapter, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f50.google.com (mail-pa0-f50.google.com [209.85.220.50]) by kanga.kvack.org (Postfix) with ESMTP id 3AEBC6B0038 for ; Fri, 4 Sep 2015 14:09:24 -0400 (EDT) Received: by pacex6 with SMTP id ex6so31063313pac.0 for ; Fri, 04 Sep 2015 11:09:23 -0700 (PDT) Received: from mail-pa0-x233.google.com (mail-pa0-x233.google.com. [2607:f8b0:400e:c03::233]) by mx.google.com with ESMTPS id ko10si5405926pbc.208.2015.09.04.11.09.23 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 04 Sep 2015 11:09:23 -0700 (PDT) Received: by pacfv12 with SMTP id fv12so31577283pac.2 for ; Fri, 04 Sep 2015 11:09:23 -0700 (PDT) Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> From: Alexander Duyck Message-ID: <55E9DE51.7090109@gmail.com> Date: Fri, 4 Sep 2015 11:09:21 -0700 MIME-Version: 1.0 In-Reply-To: <20150904165944.4312.32435.stgit@devil> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer , netdev@vger.kernel.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com On 09/04/2015 10:00 AM, Jesper Dangaard Brouer wrote: > During TX DMA completion cleanup there exist an opportunity in the NIC > drivers to perform bulk free, without introducing additional latency. > > For an IPv4 forwarding workload the network stack is hitting the > slowpath of the kmem_cache "slub" allocator. This slowpath can be > mitigated by bulk free via the detached freelists patchset. > > Depend on patchset: > http://thread.gmane.org/gmane.linux.kernel.mm/137469 > > Kernel based on MMOTM tag 2015-08-24-16-12 from git repo: > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git > Also contains Christoph's patch "slub: Avoid irqoff/on in bulk allocation" > > > Benchmarking: Single CPU IPv4 forwarding UDP (generator pktgen): > * Before: 2043575 pps > * After : 2090522 pps > * Improvements: +46947 pps and -10.99 ns > > In the before case, perf report shows slub free hits the slowpath: > 1.98% ksoftirqd/6 [kernel.vmlinux] [k] __slab_free.isra.72 > 1.29% ksoftirqd/6 [kernel.vmlinux] [k] cmpxchg_double_slab.isra.71 > 0.95% ksoftirqd/6 [kernel.vmlinux] [k] kmem_cache_free > 0.95% ksoftirqd/6 [kernel.vmlinux] [k] kmem_cache_alloc > 0.20% ksoftirqd/6 [kernel.vmlinux] [k] __cmpxchg_double_slab.isra.60 > 0.17% ksoftirqd/6 [kernel.vmlinux] [k] ___slab_alloc.isra.68 > 0.09% ksoftirqd/6 [kernel.vmlinux] [k] __slab_alloc.isra.69 > > After the slowpath calls are almost gone: > 0.22% ksoftirqd/6 [kernel.vmlinux] [k] __cmpxchg_double_slab.isra.60 > 0.18% ksoftirqd/6 [kernel.vmlinux] [k] ___slab_alloc.isra.68 > 0.14% ksoftirqd/6 [kernel.vmlinux] [k] __slab_free.isra.72 > 0.14% ksoftirqd/6 [kernel.vmlinux] [k] cmpxchg_double_slab.isra.71 > 0.08% ksoftirqd/6 [kernel.vmlinux] [k] __slab_alloc.isra.69 > > > Extra info, tuning SLUB per CPU structures gives further improvements: > * slub-tuned: 2124217 pps > * patched increase: +33695 pps and -7.59 ns > * before increase: +80642 pps and -18.58 ns > > Tuning done: > echo 256 > /sys/kernel/slab/skbuff_head_cache/cpu_partial > echo 9 > /sys/kernel/slab/skbuff_head_cache/min_partial > > Without SLUB tuning, same performance comes with kernel cmdline "slab_nomerge": > * slab_nomerge: 2121824 pps > > Test notes: > * Notice very fast CPU i7-4790K CPU @ 4.00GHz > * gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) > * kernel 4.1.0-mmotm-2015-08-24-16-12+ #271 SMP > * Generator pktgen UDP single flow (pktgen_sample03_burst_single_flow.sh) > * Tuned for forwarding: > - unloaded netfilter modules > - Sysctl settings: > - net/ipv4/conf/default/rp_filter = 0 > - net/ipv4/conf/all/rp_filter = 0 > - (Forwarding performance is affected by early demux) > - net/ipv4/ip_early_demux = 0 > - net.ipv4.ip_forward = 1 > - Disabled GRO on NICs > - ethtool -K ixgbe3 gro off tso off gso off > > --- This is an interesting start. However I feel like it might work better if you were to create a per-cpu pool for skbs that could be freed and allocated in NAPI context. So for example we already have napi_alloc_skb, why not just add a napi_free_skb and then make the array of objects to be freed part of a pool that could be used for either allocation or freeing? If the pool runs empty you just allocate something like 8 or 16 new skb heads, and if you fill it you just free half of the list? - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f174.google.com (mail-io0-f174.google.com [209.85.223.174]) by kanga.kvack.org (Postfix) with ESMTP id 85C346B0256 for ; Fri, 4 Sep 2015 14:47:18 -0400 (EDT) Received: by iofb144 with SMTP id b144so34029196iof.1 for ; Fri, 04 Sep 2015 11:47:18 -0700 (PDT) Received: from mail-io0-f178.google.com (mail-io0-f178.google.com. [209.85.223.178]) by mx.google.com with ESMTPS id b2si3322207igb.24.2015.09.04.11.47.17 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 04 Sep 2015 11:47:17 -0700 (PDT) Received: by iofh134 with SMTP id h134so33964531iof.0 for ; Fri, 04 Sep 2015 11:47:17 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20150904170046.4312.38018.stgit@devil> References: <20150904165944.4312.32435.stgit@devil> <20150904170046.4312.38018.stgit@devil> Date: Fri, 4 Sep 2015 11:47:17 -0700 Message-ID: Subject: Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() From: Tom Herbert Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: Linux Kernel Network Developers , akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com On Fri, Sep 4, 2015 at 10:00 AM, Jesper Dangaard Brouer wrote: > Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(), > in the network stack in form of function kfree_skb_bulk() which bulk > free SKBs (not skb clones or skb->head, yet). > > As this is the third user of SKB reference decrementing, split out > refcnt decrement into helper function and use this in all call points. > > Signed-off-by: Jesper Dangaard Brouer > --- > include/linux/skbuff.h | 1 + > net/core/skbuff.c | 87 +++++++++++++++++++++++++++++++++++++++--------- > 2 files changed, 71 insertions(+), 17 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index b97597970ce7..e5f1e007723b 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -762,6 +762,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb) > } > > void kfree_skb(struct sk_buff *skb); > +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size); > void kfree_skb_list(struct sk_buff *segs); > void skb_tx_error(struct sk_buff *skb); > void consume_skb(struct sk_buff *skb); > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 429b407b4fe6..034545934158 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -661,26 +661,83 @@ void __kfree_skb(struct sk_buff *skb) > } > EXPORT_SYMBOL(__kfree_skb); > > +/* > + * skb_dec_and_test - Helper to drop ref to SKB and see is ready to free > + * @skb: buffer to decrement reference > + * > + * Drop a reference to the buffer, and return true if it is ready > + * to free. Which is if the usage count has hit zero or is equal to 1. > + * > + * This is performance critical code that should be inlined. > + */ > +static inline bool skb_dec_and_test(struct sk_buff *skb) > +{ > + if (unlikely(!skb)) > + return false; > + if (likely(atomic_read(&skb->users) == 1)) > + smp_rmb(); > + else if (likely(!atomic_dec_and_test(&skb->users))) > + return false; > + /* If reaching here SKB is ready to free */ > + return true; > +} > + > /** > * kfree_skb - free an sk_buff > * @skb: buffer to free > * > * Drop a reference to the buffer and free it if the usage count has > - * hit zero. > + * hit zero or is equal to 1. > */ > void kfree_skb(struct sk_buff *skb) > { > - if (unlikely(!skb)) > - return; > - if (likely(atomic_read(&skb->users) == 1)) > - smp_rmb(); > - else if (likely(!atomic_dec_and_test(&skb->users))) > - return; > - trace_kfree_skb(skb, __builtin_return_address(0)); > - __kfree_skb(skb); > + if (skb_dec_and_test(skb)) { > + trace_kfree_skb(skb, __builtin_return_address(0)); > + __kfree_skb(skb); > + } > } > EXPORT_SYMBOL(kfree_skb); > > +/** > + * kfree_skb_bulk - bulk free SKBs when refcnt allows to > + * @skbs: array of SKBs to free > + * @size: number of SKBs in array > + * > + * If SKB refcnt allows for free, then release any auxiliary data > + * and then bulk free SKBs to the SLAB allocator. > + * > + * Note that interrupts must be enabled when calling this function. > + */ > +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size) > +{ What not pass a list of skbs (e.g. using skb->next)? > + int i; > + size_t cnt = 0; > + > + for (i = 0; i < size; i++) { > + struct sk_buff *skb = skbs[i]; > + > + if (!skb_dec_and_test(skb)) > + continue; /* skip skb, not ready to free */ > + > + /* Construct an array of SKBs, ready to be free'ed and > + * cleanup all auxiliary, before bulk free to SLAB. > + * For now, only handle non-cloned SKBs, related to > + * SLAB skbuff_head_cache > + */ > + if (skb->fclone == SKB_FCLONE_UNAVAILABLE) { > + skb_release_all(skb); > + skbs[cnt++] = skb; > + } else { > + /* SKB was a clone, don't handle this case */ > + __kfree_skb(skb); > + } > + } > + if (likely(cnt)) { > + kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs); > + } > +} > +EXPORT_SYMBOL(kfree_skb_bulk); > + > void kfree_skb_list(struct sk_buff *segs) > { > while (segs) { > @@ -722,14 +779,10 @@ EXPORT_SYMBOL(skb_tx_error); > */ > void consume_skb(struct sk_buff *skb) > { > - if (unlikely(!skb)) > - return; > - if (likely(atomic_read(&skb->users) == 1)) > - smp_rmb(); > - else if (likely(!atomic_dec_and_test(&skb->users))) > - return; > - trace_consume_skb(skb); > - __kfree_skb(skb); > + if (skb_dec_and_test(skb)) { > + trace_consume_skb(skb); > + __kfree_skb(skb); > + } > } > EXPORT_SYMBOL(consume_skb); > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f47.google.com (mail-qg0-f47.google.com [209.85.192.47]) by kanga.kvack.org (Postfix) with ESMTP id 45C7B6B0038 for ; Fri, 4 Sep 2015 14:55:26 -0400 (EDT) Received: by qgez77 with SMTP id z77so23482535qge.1 for ; Fri, 04 Sep 2015 11:55:26 -0700 (PDT) Received: from resqmta-ch2-05v.sys.comcast.net (resqmta-ch2-05v.sys.comcast.net. [2001:558:fe21:29:69:252:207:37]) by mx.google.com with ESMTPS id g192si360152qhc.93.2015.09.04.11.55.25 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Fri, 04 Sep 2015 11:55:25 -0700 (PDT) Date: Fri, 4 Sep 2015 13:55:24 -0500 (CDT) From: Christoph Lameter Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. In-Reply-To: <55E9DE51.7090109@gmail.com> Message-ID: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Alexander Duyck Cc: Jesper Dangaard Brouer , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com On Fri, 4 Sep 2015, Alexander Duyck wrote: > were to create a per-cpu pool for skbs that could be freed and allocated in > NAPI context. So for example we already have napi_alloc_skb, why not just add > a napi_free_skb and then make the array of objects to be freed part of a pool > that could be used for either allocation or freeing? If the pool runs empty > you just allocate something like 8 or 16 new skb heads, and if you fill it you > just free half of the list? The slab allocators provide something like a per cpu pool for you to optimize object alloc and free. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by kanga.kvack.org (Postfix) with ESMTP id 95F626B0038 for ; Fri, 4 Sep 2015 16:39:17 -0400 (EDT) Received: by pacex6 with SMTP id ex6so34161819pac.0 for ; Fri, 04 Sep 2015 13:39:17 -0700 (PDT) Received: from mail-pa0-x235.google.com (mail-pa0-x235.google.com. [2607:f8b0:400e:c03::235]) by mx.google.com with ESMTPS id hw3si984478pbb.159.2015.09.04.13.39.16 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 04 Sep 2015 13:39:17 -0700 (PDT) Received: by pacwi10 with SMTP id wi10so34112097pac.3 for ; Fri, 04 Sep 2015 13:39:16 -0700 (PDT) Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> From: Alexander Duyck Message-ID: <55EA0172.2040505@gmail.com> Date: Fri, 4 Sep 2015 13:39:14 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Jesper Dangaard Brouer , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com On 09/04/2015 11:55 AM, Christoph Lameter wrote: > On Fri, 4 Sep 2015, Alexander Duyck wrote: > >> were to create a per-cpu pool for skbs that could be freed and allocated in >> NAPI context. So for example we already have napi_alloc_skb, why not just add >> a napi_free_skb and then make the array of objects to be freed part of a pool >> that could be used for either allocation or freeing? If the pool runs empty >> you just allocate something like 8 or 16 new skb heads, and if you fill it you >> just free half of the list? > The slab allocators provide something like a per cpu pool for you to > optimize object alloc and free. Right, but one of the reasons for Jesper to implement the bulk alloc/free is to avoid the cmpxchg that is being used to get stuff into or off of the per cpu lists. In the case of network drivers they are running in softirq context almost exclusively. As such it is useful to have a set of buffers that can be acquired or freed from this context without the need to use any synchronization primitives. Then once the softirq context ends then we can free up some or all of the resources back to the slab allocator. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f172.google.com (mail-ig0-f172.google.com [209.85.213.172]) by kanga.kvack.org (Postfix) with ESMTP id BC0CD6B0038 for ; Fri, 4 Sep 2015 19:45:15 -0400 (EDT) Received: by igbkq10 with SMTP id kq10so23031085igb.0 for ; Fri, 04 Sep 2015 16:45:15 -0700 (PDT) Received: from resqmta-ch2-02v.sys.comcast.net (resqmta-ch2-02v.sys.comcast.net. [2001:558:fe21:29:69:252:207:34]) by mx.google.com with ESMTPS id mf6si4057423igb.0.2015.09.04.16.45.14 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Fri, 04 Sep 2015 16:45:15 -0700 (PDT) Date: Fri, 4 Sep 2015 18:45:13 -0500 (CDT) From: Christoph Lameter Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. In-Reply-To: <55EA0172.2040505@gmail.com> Message-ID: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <55EA0172.2040505@gmail.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Alexander Duyck Cc: Jesper Dangaard Brouer , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com On Fri, 4 Sep 2015, Alexander Duyck wrote: > Right, but one of the reasons for Jesper to implement the bulk alloc/free is > to avoid the cmpxchg that is being used to get stuff into or off of the per > cpu lists. There is no full cmpxchg used for the per cpu lists. Its a cmpxchg without lock semantics which is very cheap. > In the case of network drivers they are running in softirq context almost > exclusively. As such it is useful to have a set of buffers that can be > acquired or freed from this context without the need to use any > synchronization primitives. Then once the softirq context ends then we can > free up some or all of the resources back to the slab allocator. That is the case in the slab allocators. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f170.google.com (mail-io0-f170.google.com [209.85.223.170]) by kanga.kvack.org (Postfix) with ESMTP id EE1236B0038 for ; Sat, 5 Sep 2015 07:18:33 -0400 (EDT) Received: by ioiz6 with SMTP id z6so48689421ioi.2 for ; Sat, 05 Sep 2015 04:18:33 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q2si9494027pdi.59.2015.09.05.04.18.32 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 05 Sep 2015 04:18:33 -0700 (PDT) Date: Sat, 5 Sep 2015 13:18:25 +0200 From: Jesper Dangaard Brouer Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. Message-ID: <20150905131825.6c04837d@redhat.com> In-Reply-To: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <55EA0172.2040505@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Alexander Duyck , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com, brouer@redhat.com On Fri, 4 Sep 2015 18:45:13 -0500 (CDT) Christoph Lameter wrote: > On Fri, 4 Sep 2015, Alexander Duyck wrote: > > Right, but one of the reasons for Jesper to implement the bulk alloc/free is > > to avoid the cmpxchg that is being used to get stuff into or off of the per > > cpu lists. > > There is no full cmpxchg used for the per cpu lists. Its a cmpxchg without > lock semantics which is very cheap. The double_cmpxchg without lock prefix still cost 9 cycles, which is very fast but still a cost (add approx 19 cycles for a lock prefix). It is slower than local_irq_disable + local_irq_enable that only cost 7 cycles, which the bulking call uses. (That is the reason bulk calls with 1 object can almost compete with fastpath). > > In the case of network drivers they are running in softirq context almost > > exclusively. As such it is useful to have a set of buffers that can be > > acquired or freed from this context without the need to use any > > synchronization primitives. Then once the softirq context ends then we can > > free up some or all of the resources back to the slab allocator. > > That is the case in the slab allocators. There is a potential for taking advantage of this softirq context, which is basically what my qmempool implementation did. But we have now optimized the slub allocator to an extend that (in case of slab-tuning or slab_nomerge) is faster than my qmempool implementation. Thus, I would like a smaller/slimmer layer than qmempool. We do need some per CPU cache for allocations, like Alex suggests, but I'm not sure we need that for the free side. For now I'm returning objects/skbs directly to slub, and is hoping enough objects can be merged in a detached freelist, which allow me to return several objects with a single locked double_cmpxchg. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id 3F02D6B0038 for ; Mon, 7 Sep 2015 04:16:18 -0400 (EDT) Received: by pacex6 with SMTP id ex6so90266361pac.0 for ; Mon, 07 Sep 2015 01:16:18 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id ev1si5822531pbb.19.2015.09.07.01.16.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Sep 2015 01:16:17 -0700 (PDT) Date: Mon, 7 Sep 2015 10:16:10 +0200 From: Jesper Dangaard Brouer Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. Message-ID: <20150907101610.44504597@redhat.com> In-Reply-To: <55E9DE51.7090109@gmail.com> References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Alexander Duyck Cc: netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com, brouer@redhat.com On Fri, 4 Sep 2015 11:09:21 -0700 Alexander Duyck wrote: > This is an interesting start. However I feel like it might work better > if you were to create a per-cpu pool for skbs that could be freed and > allocated in NAPI context. So for example we already have > napi_alloc_skb, why not just add a napi_free_skb I do like the idea... > and then make the array > of objects to be freed part of a pool that could be used for either > allocation or freeing? If the pool runs empty you just allocate > something like 8 or 16 new skb heads, and if you fill it you just free > half of the list? But I worry that this algorithm will "randomize" the (skb) objects. And the SLUB bulk optimization only works if we have many objects belonging to the same page. It would likely be fastest to implement a simple stack (for these per-cpu pools), but I again worry that it would randomize the object-pages. A simple queue might be better, but slightly slower. Guess I could just reuse part of qmempool / alf_queue as a quick test. Having a per-cpu pool in networking would solve the problem of the slub per-cpu pool isn't large enough for our use-case. On the other hand, maybe we should fix slub to dynamically adjust the size of it's per-cpu resources? A pre-req knowledge (for people not knowing slub's internal details): Slub alloc path will pickup a page, and empty all objects for that page before proceeding to the next page. Thus, slub bulk alloc will give many objects belonging to the page. I'm trying to keep these objects grouped together until they can be free'ed in a bulk. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f44.google.com (mail-pa0-f44.google.com [209.85.220.44]) by kanga.kvack.org (Postfix) with ESMTP id 8B9D96B0038 for ; Mon, 7 Sep 2015 04:41:10 -0400 (EDT) Received: by padhy16 with SMTP id hy16so88711708pad.1 for ; Mon, 07 Sep 2015 01:41:10 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id rb8si18934632pab.112.2015.09.07.01.41.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Sep 2015 01:41:09 -0700 (PDT) Date: Mon, 7 Sep 2015 10:41:01 +0200 From: Jesper Dangaard Brouer Subject: Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Message-ID: <20150907104101.3e392a6d@redhat.com> In-Reply-To: References: <20150904165944.4312.32435.stgit@devil> <20150904170046.4312.38018.stgit@devil> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Tom Herbert Cc: Linux Kernel Network Developers , akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com, brouer@redhat.com On Fri, 4 Sep 2015 11:47:17 -0700 Tom Herbert wrote: > On Fri, Sep 4, 2015 at 10:00 AM, Jesper Dangaard Brouer wrote: > > Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(), > > in the network stack in form of function kfree_skb_bulk() which bulk > > free SKBs (not skb clones or skb->head, yet). > > [...] > > +/** > > + * kfree_skb_bulk - bulk free SKBs when refcnt allows to > > + * @skbs: array of SKBs to free > > + * @size: number of SKBs in array > > + * > > + * If SKB refcnt allows for free, then release any auxiliary data > > + * and then bulk free SKBs to the SLAB allocator. > > + * > > + * Note that interrupts must be enabled when calling this function. > > + */ > > +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size) > > +{ > > What not pass a list of skbs (e.g. using skb->next)? Because the next layer, the slab API needs an array: kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) Look at the patch: [PATCH V2 3/3] slub: build detached freelist with look-ahead http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=137472 Where I use this array to progressively scan for objects belonging to the same page. (A subtle detail is I manage to zero out the array, which is good from a security/error-handling point of view, as pointers to the objects are not left dangling on the stack). I cannot argue that, writing skb->next comes as an additional cost, because the slUb free also writes into this cacheline. Perhaps the slAb allocator does not? [...] > > + if (likely(cnt)) { > > + kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs); > > + } > > +} > > +EXPORT_SYMBOL(kfree_skb_bulk); -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f177.google.com (mail-ig0-f177.google.com [209.85.213.177]) by kanga.kvack.org (Postfix) with ESMTP id 1CC2C6B0038 for ; Mon, 7 Sep 2015 12:25:51 -0400 (EDT) Received: by igcrk20 with SMTP id rk20so56792747igc.1 for ; Mon, 07 Sep 2015 09:25:51 -0700 (PDT) Received: from mail-io0-f195.google.com (mail-io0-f195.google.com. [209.85.223.195]) by mx.google.com with ESMTPS id lq8si640124igb.70.2015.09.07.09.25.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Sep 2015 09:25:50 -0700 (PDT) Received: by ioiz6 with SMTP id z6so9955145ioi.3 for ; Mon, 07 Sep 2015 09:25:50 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20150907104101.3e392a6d@redhat.com> References: <20150904165944.4312.32435.stgit@devil> <20150904170046.4312.38018.stgit@devil> <20150907104101.3e392a6d@redhat.com> Date: Mon, 7 Sep 2015 09:25:49 -0700 Message-ID: Subject: Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() From: Tom Herbert Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: Linux Kernel Network Developers , akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com >> What not pass a list of skbs (e.g. using skb->next)? > > Because the next layer, the slab API needs an array: > kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > I suppose we could ask the same question of that function. IMO encouraging drivers to define arrays of pointers on the stack like you're doing in the ixgbe patch is a bad direction. In any case I believe this would be simpler in the networking side just to maintain a list of skb's to free. Then the dev_free_waitlist structure might not be needed then since we could just use a skb_buf_head for that. Tom > Look at the patch: > [PATCH V2 3/3] slub: build detached freelist with look-ahead > http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=137472 > > Where I use this array to progressively scan for objects belonging to > the same page. (A subtle detail is I manage to zero out the array, > which is good from a security/error-handling point of view, as pointers > to the objects are not left dangling on the stack). > > > I cannot argue that, writing skb->next comes as an additional cost, > because the slUb free also writes into this cacheline. Perhaps the > slAb allocator does not? > > [...] >> > + if (likely(cnt)) { >> > + kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs); >> > + } >> > +} >> > +EXPORT_SYMBOL(kfree_skb_bulk); > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Sr. Network Kernel Developer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f182.google.com (mail-qk0-f182.google.com [209.85.220.182]) by kanga.kvack.org (Postfix) with ESMTP id F244A6B0038 for ; Mon, 7 Sep 2015 16:14:57 -0400 (EDT) Received: by qkcj187 with SMTP id j187so35842814qkc.2 for ; Mon, 07 Sep 2015 13:14:57 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id k89si1027450qge.7.2015.09.07.13.14.56 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Sep 2015 13:14:57 -0700 (PDT) Date: Mon, 7 Sep 2015 22:14:48 +0200 From: Jesper Dangaard Brouer Subject: Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Message-ID: <20150907221448.2b18b174@redhat.com> In-Reply-To: References: <20150904165944.4312.32435.stgit@devil> <20150904170046.4312.38018.stgit@devil> <20150907104101.3e392a6d@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Tom Herbert Cc: Linux Kernel Network Developers , akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com, brouer@redhat.com On Mon, 7 Sep 2015 09:25:49 -0700 Tom Herbert wrote: > >> What not pass a list of skbs (e.g. using skb->next)? > > > > Because the next layer, the slab API needs an array: > > kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > > > > I suppose we could ask the same question of that function. IMO > encouraging drivers to define arrays of pointers on the stack like > you're doing in the ixgbe patch is a bad direction. > > In any case I believe this would be simpler in the networking side > just to maintain a list of skb's to free. Then the dev_free_waitlist > structure might not be needed then since we could just use a > skb_buf_head for that. I guess it is more natural for the network side to work with skb lists. But I'm keeping it for slab/slub as we cannot assume/enforce objects of a specific data type. I worried about how large bulk free we should allow, due to the interaction with skb->destructor which for sockets affect their memory accounting. E.g. we have seen issues with hypervisor network drivers (Xen and HyperV) that are too slow to cleanup their TX completion queue that their TCP bandwidth get limited by tcp_limit_output_bytes. I capped it at 32, and the NAPI budget will cap it at 64. By the following argument, bulk free of 64 objects/skb's is not a problem. The delay I'm introducing is very small, before the first real kfree_skb is called, which calls the destructor with free up socket memory accounting. Assume measured packet rate of: 2105011 pps Time between packets (1/2105011*10^9): 475 ns Perf shows ixgbe_clean_tx_irq() takes: 1.23% Extrapolating the function call cost: 5.84 ns (475*(1.23/100)) Processing 64 packets in ixgbe_clean_tx_irq() 373 ns. At 10Gbit/s how many bytes can arrive in this period, only: 466 bytes. ((373/10^9)*(10000*10^6)/8) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f48.google.com (mail-pa0-f48.google.com [209.85.220.48]) by kanga.kvack.org (Postfix) with ESMTP id 4D3436B0038 for ; Mon, 7 Sep 2015 17:23:42 -0400 (EDT) Received: by padhk3 with SMTP id hk3so19882874pad.3 for ; Mon, 07 Sep 2015 14:23:42 -0700 (PDT) Received: from mail-pa0-x229.google.com (mail-pa0-x229.google.com. [2607:f8b0:400e:c03::229]) by mx.google.com with ESMTPS id xf4si1726020pbc.138.2015.09.07.14.23.41 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Sep 2015 14:23:41 -0700 (PDT) Received: by padhy16 with SMTP id hy16so102599904pad.1 for ; Mon, 07 Sep 2015 14:23:41 -0700 (PDT) Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <20150907101610.44504597@redhat.com> From: Alexander Duyck Message-ID: <55EE005B.9080802@gmail.com> Date: Mon, 7 Sep 2015 14:23:39 -0700 MIME-Version: 1.0 In-Reply-To: <20150907101610.44504597@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com On 09/07/2015 01:16 AM, Jesper Dangaard Brouer wrote: > On Fri, 4 Sep 2015 11:09:21 -0700 > Alexander Duyck wrote: > >> This is an interesting start. However I feel like it might work better >> if you were to create a per-cpu pool for skbs that could be freed and >> allocated in NAPI context. So for example we already have >> napi_alloc_skb, why not just add a napi_free_skb > I do like the idea... If nothing else you want to avoid having to redo this code for every driver. If you can just replace dev_kfree_skb with some other freeing call it will make it much easier to convert other drivers. >> and then make the array >> of objects to be freed part of a pool that could be used for either >> allocation or freeing? If the pool runs empty you just allocate >> something like 8 or 16 new skb heads, and if you fill it you just free >> half of the list? > But I worry that this algorithm will "randomize" the (skb) objects. > And the SLUB bulk optimization only works if we have many objects > belonging to the same page. Agreed to some extent, however at the same time what this does is allow for a certain amount of skb recycling. So instead of freeing the buffers received from the socket you would likely be recycling them and sending them back as Rx skbs. In the case of a heavy routing workload you would likely just be cycling through the same set of buffers and cleaning them off of transmit and placing them back on receive. The general idea is to keep the memory footprint small so recycling Tx buffers to use for Rx can have its advantages in terms of keeping things confined to limits of the L1/L2 cache. > It would likely be fastest to implement a simple stack (for these > per-cpu pools), but I again worry that it would randomize the > object-pages. A simple queue might be better, but slightly slower. > Guess I could just reuse part of qmempool / alf_queue as a quick test. I would say don't over engineer it. A stack is the simplest. The qmempool / alf_queue is just going to add extra overhead. The added advantage to the stack is that you are working with pointers and you are guaranteed that the list of pointers are going to be linear. If you use a queue clean-up will require up to 2 blocks of freeing in case the ring has wrapped. > Having a per-cpu pool in networking would solve the problem of the slub > per-cpu pool isn't large enough for our use-case. On the other hand, > maybe we should fix slub to dynamically adjust the size of it's per-cpu > resources? The per-cpu pool is just meant to replace the the per-driver pool you were using. By using a per-cpu pool you would get better aggregation and can just flush the freed buffers at the end of the Rx softirq or when the pool is full instead of having to flush smaller lists per call to napi->poll. > A pre-req knowledge (for people not knowing slub's internal details): > Slub alloc path will pickup a page, and empty all objects for that page > before proceeding to the next page. Thus, slub bulk alloc will give > many objects belonging to the page. I'm trying to keep these objects > grouped together until they can be free'ed in a bulk. The problem is you aren't going to be able to keep them together very easily. Yes they might be allocated all from one spot on Rx but they can very easily end up scattered to multiple locations. The same applies to Tx where you will have multiple flows all outgoing on one port. That is why I was thinking adding some skb recycling via a per-cpu stack might be useful especially since you have to either fill or empty the stack when you allocate or free multiple skbs anyway. In addition it provides an easy way for a bulk alloc and a bulk free to share data structures without adding additional overhead by keeping them separate. If you managed it with some sort of high-water/low-water mark type setup you could very well keep the bulk-alloc/free busy without too much fragmentation. For the socket transmit/receive case the thing you have to keep in mind is that if you reuse the buffers you are just going to be throwing them back at the sockets which are likely not using bulk-free anyway. So in that case reuse could actually improve things by simply reducing the number of calls to bulk-alloc you will need to make since things like TSO allow you to send 64K using a single sk_buff, while you will be likely be receiving one or more acks on the receive side which will require allocations. - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f171.google.com (mail-ig0-f171.google.com [209.85.213.171]) by kanga.kvack.org (Postfix) with ESMTP id D48D16B0038 for ; Tue, 8 Sep 2015 13:32:41 -0400 (EDT) Received: by igxx6 with SMTP id x6so22048629igx.1 for ; Tue, 08 Sep 2015 10:32:41 -0700 (PDT) Received: from resqmta-ch2-06v.sys.comcast.net (resqmta-ch2-06v.sys.comcast.net. [2001:558:fe21:29:69:252:207:38]) by mx.google.com with ESMTPS id e36si3848238ioj.101.2015.09.08.10.32.41 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Tue, 08 Sep 2015 10:32:41 -0700 (PDT) Date: Tue, 8 Sep 2015 12:32:40 -0500 (CDT) From: Christoph Lameter Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. In-Reply-To: <20150905131825.6c04837d@redhat.com> Message-ID: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <55EA0172.2040505@gmail.com> <20150905131825.6c04837d@redhat.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: Alexander Duyck , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com On Sat, 5 Sep 2015, Jesper Dangaard Brouer wrote: > The double_cmpxchg without lock prefix still cost 9 cycles, which is > very fast but still a cost (add approx 19 cycles for a lock prefix). > > It is slower than local_irq_disable + local_irq_enable that only cost > 7 cycles, which the bulking call uses. (That is the reason bulk calls > with 1 object can almost compete with fastpath). Hmmm... Guess we need to come up with distinct version of kmalloc() for irq and non irq contexts to take advantage of that . Most at non irq context anyways. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id F07C66B0038 for ; Tue, 8 Sep 2015 17:01:13 -0400 (EDT) Received: by padhk3 with SMTP id hk3so49122843pad.3 for ; Tue, 08 Sep 2015 14:01:13 -0700 (PDT) Received: from shards.monkeyblade.net (shards.monkeyblade.net. [2001:4f8:3:36:211:85ff:fe63:a549]) by mx.google.com with ESMTP id tf1si2592036pac.5.2015.09.08.14.01.12 for ; Tue, 08 Sep 2015 14:01:13 -0700 (PDT) Date: Tue, 08 Sep 2015 14:01:10 -0700 (PDT) Message-Id: <20150908.140110.899240065088272758.davem@davemloft.net> Subject: Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() From: David Miller In-Reply-To: <20150904170046.4312.38018.stgit@devil> References: <20150904165944.4312.32435.stgit@devil> <20150904170046.4312.38018.stgit@devil> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: brouer@redhat.com Cc: netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, cl@linux.com, paulmck@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com From: Jesper Dangaard Brouer Date: Fri, 04 Sep 2015 19:00:53 +0200 > +/** > + * kfree_skb_bulk - bulk free SKBs when refcnt allows to > + * @skbs: array of SKBs to free > + * @size: number of SKBs in array > + * > + * If SKB refcnt allows for free, then release any auxiliary data > + * and then bulk free SKBs to the SLAB allocator. > + * > + * Note that interrupts must be enabled when calling this function. > + */ > +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size) > +{ > + int i; > + size_t cnt = 0; > + > + for (i = 0; i < size; i++) { > + struct sk_buff *skb = skbs[i]; > + > + if (!skb_dec_and_test(skb)) > + continue; /* skip skb, not ready to free */ > + > + /* Construct an array of SKBs, ready to be free'ed and > + * cleanup all auxiliary, before bulk free to SLAB. > + * For now, only handle non-cloned SKBs, related to > + * SLAB skbuff_head_cache > + */ > + if (skb->fclone == SKB_FCLONE_UNAVAILABLE) { > + skb_release_all(skb); > + skbs[cnt++] = skb; > + } else { > + /* SKB was a clone, don't handle this case */ > + __kfree_skb(skb); > + } > + } > + if (likely(cnt)) { > + kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs); > + } > +} You're going to have to do a trace_kfree_skb() or trace_consume_skb() for these things. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f169.google.com (mail-qk0-f169.google.com [209.85.220.169]) by kanga.kvack.org (Postfix) with ESMTP id 573F56B0038 for ; Wed, 9 Sep 2015 08:59:29 -0400 (EDT) Received: by qkfq186 with SMTP id q186so3162026qkf.1 for ; Wed, 09 Sep 2015 05:59:29 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id o3si8115138qki.31.2015.09.09.05.59.28 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Sep 2015 05:59:28 -0700 (PDT) Date: Wed, 9 Sep 2015 14:59:19 +0200 From: Jesper Dangaard Brouer Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. Message-ID: <20150909145919.4d68ea36@redhat.com> In-Reply-To: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <55EA0172.2040505@gmail.com> <20150905131825.6c04837d@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Alexander Duyck , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com, brouer@redhat.com On Tue, 8 Sep 2015 12:32:40 -0500 (CDT) Christoph Lameter wrote: > On Sat, 5 Sep 2015, Jesper Dangaard Brouer wrote: > > > The double_cmpxchg without lock prefix still cost 9 cycles, which is > > very fast but still a cost (add approx 19 cycles for a lock prefix). > > > > It is slower than local_irq_disable + local_irq_enable that only cost > > 7 cycles, which the bulking call uses. (That is the reason bulk calls > > with 1 object can almost compete with fastpath). > > Hmmm... Guess we need to come up with distinct version of kmalloc() for > irq and non irq contexts to take advantage of that . Most at non irq > context anyways. I agree, it would be an easy win. Do notice this will have the most impact for the slAb allocator. I estimate alloc + free cost would save: * slAb would save approx 60 cycles * slUb would save approx 4 cycles We might consider keeping the slUb approach as it would be more friendly for RT with less IRQ disabling. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f174.google.com (mail-qk0-f174.google.com [209.85.220.174]) by kanga.kvack.org (Postfix) with ESMTP id 641F66B0254 for ; Wed, 9 Sep 2015 10:08:50 -0400 (EDT) Received: by qkdw123 with SMTP id w123so4322466qkd.0 for ; Wed, 09 Sep 2015 07:08:50 -0700 (PDT) Received: from resqmta-ch2-09v.sys.comcast.net (resqmta-ch2-09v.sys.comcast.net. [2001:558:fe21:29:69:252:207:41]) by mx.google.com with ESMTPS id n23si8310696qkl.111.2015.09.09.07.08.49 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 09 Sep 2015 07:08:49 -0700 (PDT) Date: Wed, 9 Sep 2015 09:08:47 -0500 (CDT) From: Christoph Lameter Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. In-Reply-To: <20150909145919.4d68ea36@redhat.com> Message-ID: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <55EA0172.2040505@gmail.com> <20150905131825.6c04837d@redhat.com> <20150909145919.4d68ea36@redhat.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: Alexander Duyck , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com On Wed, 9 Sep 2015, Jesper Dangaard Brouer wrote: > > Hmmm... Guess we need to come up with distinct version of kmalloc() for > > irq and non irq contexts to take advantage of that . Most at non irq > > context anyways. > > I agree, it would be an easy win. Do notice this will have the most > impact for the slAb allocator. > > I estimate alloc + free cost would save: > * slAb would save approx 60 cycles > * slUb would save approx 4 cycles > > We might consider keeping the slUb approach as it would be more > friendly for RT with less IRQ disabling. IRQ disabling it a mixed bag. Older cpus have higher latencies there and also virtualized contexts may require the hypervisor tracks the interrupt state. For recent intel cpus this is certainly a workable approach. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f43.google.com (mail-qg0-f43.google.com [209.85.192.43]) by kanga.kvack.org (Postfix) with ESMTP id 3FCA56B0038 for ; Wed, 16 Sep 2015 06:02:37 -0400 (EDT) Received: by qgt47 with SMTP id 47so166283858qgt.2 for ; Wed, 16 Sep 2015 03:02:37 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id h185si21167561qhc.83.2015.09.16.03.02.35 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 16 Sep 2015 03:02:36 -0700 (PDT) Date: Wed, 16 Sep 2015 12:02:30 +0200 From: Jesper Dangaard Brouer Subject: Experiences with slub bulk use-case for network stack Message-ID: <20150916120230.4ca75217@redhat.com> In-Reply-To: <20150904165944.4312.32435.stgit@devil> References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Christoph Lameter Cc: netdev@vger.kernel.org, akpm@linux-foundation.org, Alexander Duyck , iamjoonsoo.kim@lge.com Hint, this leads up to discussing if current bulk *ALLOC* API need to be changed... Alex and I have been working hard on practical use-case for SLAB bulking (mostly slUb), in the network stack. Here is a summary of what we have learned so far. Bulk free'ing SKBs during TX completion is a big and easy win. Specifically for slUb, normal path for freeing these objects (which are not on c->freelist) require a locked double_cmpxchg per object. The bulk free (via detached freelist patch) allow to free all objects belonging to the same slab-page, to be free'ed with a single locked double_cmpxchg. Thus, the bulk free speedup is quite an improvement. The slUb alloc is hard to beat on speed: * accessing c->freelist, local cmpxchg 9 cycles (38% of cost) * c->freelist is refilled with single locked cmpxchg In micro benchmarking it looks like we can beat alloc, because we do a local_irq_{disable,enable} (cost 7 cycles). And then pull out all objects in c->freelist. Thus, saving 9 cycles per object (counting from the 2nd object). However, in practical use-cases we are seeing the single object alloc win over bulk alloc, we believe this to be due to prefetching. When c->freelist get (semi) cache-cold, then it gets more expensive to walk the freelist (which is a basic single linked list to next free object). For bulk alloc the full freelist is walked (right-way) and objects pulled out into the array. For normal single object alloc only a single object is returned, but it does a prefetch on the next object pointer. Thus, next time single alloc is called the object will have been prefetched. Doing prefetch in bulk alloc only helps a little, as it does not have enough "time" between accessing/walking the freelist for objects. So, how can we solve this and make bulk alloc faster? Alex and I had the idea of bulk alloc returns an "allocator specific cache" data-structure (and we add some helpers to access this). In the slUb case, the freelist is a single linked pointer list. In the network stack the skb objects have a skb->next pointer, which is located at the same position as freelist pointer. Thus, simply returning the freelist directly, could be interpreted as a skb-list. The helper API would then do the prefetching, when pulling out objects. For the slUb case, we would simply cmpxchg either c->freelist or page->freelist with a NULL ptr, and then own all objects on the freelist. This also reduce the time we keep IRQs disabled. API wise, we don't (necessary) know how many objects are on the freelist (without first walking the list, which would cause stalls on data, which we are trying to avoid). Thus, the API of always returning the exact number of requested objects will not work... -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer (related to http://thread.gmane.org/gmane.linux.kernel.mm/137469) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yk0-f180.google.com (mail-yk0-f180.google.com [209.85.160.180]) by kanga.kvack.org (Postfix) with ESMTP id 4DA966B0038 for ; Wed, 16 Sep 2015 11:13:28 -0400 (EDT) Received: by ykdg206 with SMTP id g206so221835327ykd.1 for ; Wed, 16 Sep 2015 08:13:28 -0700 (PDT) Received: from resqmta-ch2-03v.sys.comcast.net (resqmta-ch2-03v.sys.comcast.net. [2001:558:fe21:29:69:252:207:35]) by mx.google.com with ESMTPS id p32si22294847qge.61.2015.09.16.08.13.26 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 16 Sep 2015 08:13:27 -0700 (PDT) Date: Wed, 16 Sep 2015 10:13:25 -0500 (CDT) From: Christoph Lameter Subject: Re: Experiences with slub bulk use-case for network stack In-Reply-To: <20150916120230.4ca75217@redhat.com> Message-ID: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <20150916120230.4ca75217@redhat.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, netdev@vger.kernel.org, akpm@linux-foundation.org, Alexander Duyck , iamjoonsoo.kim@lge.com On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote: > > Hint, this leads up to discussing if current bulk *ALLOC* API need to > be changed... > > Alex and I have been working hard on practical use-case for SLAB > bulking (mostly slUb), in the network stack. Here is a summary of > what we have learned so far. SLAB refers to the SLAB allocator which is one slab allocator and SLUB is another slab allocator. Please keep that consistent otherwise things get confusing > Bulk free'ing SKBs during TX completion is a big and easy win. > > Specifically for slUb, normal path for freeing these objects (which > are not on c->freelist) require a locked double_cmpxchg per object. > The bulk free (via detached freelist patch) allow to free all objects > belonging to the same slab-page, to be free'ed with a single locked > double_cmpxchg. Thus, the bulk free speedup is quite an improvement. Yep. > Alex and I had the idea of bulk alloc returns an "allocator specific > cache" data-structure (and we add some helpers to access this). Maybe add some Macros to handle this? > In the slUb case, the freelist is a single linked pointer list. In > the network stack the skb objects have a skb->next pointer, which is > located at the same position as freelist pointer. Thus, simply > returning the freelist directly, could be interpreted as a skb-list. > The helper API would then do the prefetching, when pulling out > objects. The problem with the SLUB case is that the objects must be on the same slab page. > For the slUb case, we would simply cmpxchg either c->freelist or > page->freelist with a NULL ptr, and then own all objects on the > freelist. This also reduce the time we keep IRQs disabled. You dont need to disable interrupts for the cmpxchges. There is additional state in the page struct though so the updates must be done carefully. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f43.google.com (mail-qg0-f43.google.com [209.85.192.43]) by kanga.kvack.org (Postfix) with ESMTP id 4B9D66B0038 for ; Thu, 17 Sep 2015 16:17:08 -0400 (EDT) Received: by qgev79 with SMTP id v79so22800005qge.0 for ; Thu, 17 Sep 2015 13:17:08 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id p21si4333976qki.114.2015.09.17.13.17.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 17 Sep 2015 13:17:07 -0700 (PDT) Date: Thu, 17 Sep 2015 22:17:02 +0200 From: Jesper Dangaard Brouer Subject: Re: Experiences with slub bulk use-case for network stack Message-ID: <20150917221702.734a42dc@redhat.com> In-Reply-To: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <20150916120230.4ca75217@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: linux-mm@kvack.org, netdev@vger.kernel.org, akpm@linux-foundation.org, Alexander Duyck , iamjoonsoo.kim@lge.com, brouer@redhat.com On Wed, 16 Sep 2015 10:13:25 -0500 (CDT) Christoph Lameter wrote: > On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote: > > > > > Hint, this leads up to discussing if current bulk *ALLOC* API need to > > be changed... > > > > Alex and I have been working hard on practical use-case for SLAB > > bulking (mostly slUb), in the network stack. Here is a summary of > > what we have learned so far. > > SLAB refers to the SLAB allocator which is one slab allocator and SLUB is > another slab allocator. > > Please keep that consistent otherwise things get confusing This naming scheme is really confusing. I'll try to be more consistent. So, you want capital letters SLAB and SLUB when talking about a specific slab allocator implementation. > > Bulk free'ing SKBs during TX completion is a big and easy win. > > > > Specifically for slUb, normal path for freeing these objects (which > > are not on c->freelist) require a locked double_cmpxchg per object. > > The bulk free (via detached freelist patch) allow to free all objects > > belonging to the same slab-page, to be free'ed with a single locked > > double_cmpxchg. Thus, the bulk free speedup is quite an improvement. > > Yep. > > > Alex and I had the idea of bulk alloc returns an "allocator specific > > cache" data-structure (and we add some helpers to access this). > > Maybe add some Macros to handle this? Yes, helpers will likely turn out to be macros. > > In the slUb case, the freelist is a single linked pointer list. In > > the network stack the skb objects have a skb->next pointer, which is > > located at the same position as freelist pointer. Thus, simply > > returning the freelist directly, could be interpreted as a skb-list. > > The helper API would then do the prefetching, when pulling out > > objects. > > The problem with the SLUB case is that the objects must be on the same > slab page. Yes, I'm aware that, that is what we are trying to take advantage of. > > For the slUb case, we would simply cmpxchg either c->freelist or > > page->freelist with a NULL ptr, and then own all objects on the > > freelist. This also reduce the time we keep IRQs disabled. > > You dont need to disable interrupts for the cmpxchges. There is > additional state in the page struct though so the updates must be > done carefully. Yes, I'm aware of cmpxchg does not need to disable interrupts. And I plan to take advantage of this, in this new approach for bulk alloc. Our current bulk alloc disables interrupts for the full period (of collecting the number requested objects). What I'm proposing is keeping interrupts on, and then simply cmpxchg e.g 2 slab-pages out of the SLUB allocator (which the SLUB code calls freelist's). The bulk call now owns these freelists, and returns them to the caller. The API caller gets some helpers/macros to access objects, to shield him from the details (of SLUB freelist's). The pitfall with this API is we don't know how many objects are on a SLUB freelist. And we cannot walk the freelist and count them, because then we hit the problem of memory/cache stalls (that we are trying so hard to avoid). -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f174.google.com (mail-ig0-f174.google.com [209.85.213.174]) by kanga.kvack.org (Postfix) with ESMTP id 8877A6B0038 for ; Thu, 17 Sep 2015 19:57:19 -0400 (EDT) Received: by igbkq10 with SMTP id kq10so7029702igb.0 for ; Thu, 17 Sep 2015 16:57:19 -0700 (PDT) Received: from resqmta-ch2-09v.sys.comcast.net (resqmta-ch2-09v.sys.comcast.net. [2001:558:fe21:29:69:252:207:41]) by mx.google.com with ESMTPS id x75si4491674ioi.11.2015.09.17.16.57.18 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Thu, 17 Sep 2015 16:57:18 -0700 (PDT) Date: Thu, 17 Sep 2015 18:57:17 -0500 (CDT) From: Christoph Lameter Subject: Re: Experiences with slub bulk use-case for network stack In-Reply-To: <20150917221702.734a42dc@redhat.com> Message-ID: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <20150916120230.4ca75217@redhat.com> <20150917221702.734a42dc@redhat.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, netdev@vger.kernel.org, akpm@linux-foundation.org, Alexander Duyck , iamjoonsoo.kim@lge.com On Thu, 17 Sep 2015, Jesper Dangaard Brouer wrote: > What I'm proposing is keeping interrupts on, and then simply cmpxchg > e.g 2 slab-pages out of the SLUB allocator (which the SLUB code calls > freelist's). The bulk call now owns these freelists, and returns them > to the caller. The API caller gets some helpers/macros to access > objects, to shield him from the details (of SLUB freelist's). > > The pitfall with this API is we don't know how many objects are on a > SLUB freelist. And we cannot walk the freelist and count them, because > then we hit the problem of memory/cache stalls (that we are trying so > hard to avoid). If you get a fresh page from the page allocator then you know how many objects are available in a slab page. There is also a counter in each slab page for the objects allocated. The number of free object is page->objects - page->inuse. This is only true for a lockec cmpxchg. The unlocked cmpxchg used for the per cpu freelist does not use the counters in the page struct. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753659AbbHXA6U (ORCPT ); Sun, 23 Aug 2015 20:58:20 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40779 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753345AbbHXA6S (ORCPT ); Sun, 23 Aug 2015 20:58:18 -0400 Subject: [PATCH V2 0/3] slub: introducing detached freelist From: Jesper Dangaard Brouer To: linux-mm@kvack.org, Christoph Lameter , akpm@linux-foundation.org Cc: aravinda@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com, "Paul E. McKenney" , linux-kernel@vger.kernel.org, Jesper Dangaard Brouer Date: Mon, 24 Aug 2015 02:58:15 +0200 Message-ID: <20150824005727.2947.36065.stgit@localhost> User-Agent: StGit/0.17-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org REPOST: * Only updated comment in patch01 per request of Christoph Lameter. * No other objections have been made * Prev post: http://thread.gmane.org/gmane.linux.kernel.mm/135704 NEW use-cases for this API is RCU-free (and still for network NICs). Introducing what I call detached freelist, for improving the performance of object freeing in the "slowpath" of kmem_cache_free_bulk, which calls __slab_free(). The benchmarking tool are avail here: https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm See: slab_bulk_test0{1,2,3}.c Compared against existing bulk-API (in AKPMs tree), we see a small regression for small size bulking (between 2-5 cycles), but a huge improvement for the slowpath. bulk- Bulk-API-before - Bulk-API with patchset 1 - 42 cycles(tsc) 10.520 ns - 47 cycles(tsc) 11.931 ns - improved -11.9% 2 - 26 cycles(tsc) 6.697 ns - 29 cycles(tsc) 7.368 ns - improved -11.5% 3 - 22 cycles(tsc) 5.589 ns - 24 cycles(tsc) 6.003 ns - improved -9.1% 4 - 19 cycles(tsc) 4.921 ns - 22 cycles(tsc) 5.543 ns - improved -15.8% 8 - 17 cycles(tsc) 4.499 ns - 20 cycles(tsc) 5.047 ns - improved -17.6% 16 - 69 cycles(tsc) 17.424 ns - 20 cycles(tsc) 5.015 ns - improved 71.0% 30 - 88 cycles(tsc) 22.075 ns - 20 cycles(tsc) 5.062 ns - improved 77.3% 32 - 83 cycles(tsc) 20.965 ns - 20 cycles(tsc) 5.089 ns - improved 75.9% 34 - 80 cycles(tsc) 20.039 ns - 28 cycles(tsc) 7.006 ns - improved 65.0% 48 - 76 cycles(tsc) 19.252 ns - 31 cycles(tsc) 7.755 ns - improved 59.2% 64 - 86 cycles(tsc) 21.523 ns - 68 cycles(tsc) 17.203 ns - improved 20.9% 128 - 97 cycles(tsc) 24.444 ns - 72 cycles(tsc) 18.195 ns - improved 25.8% 158 - 96 cycles(tsc) 24.036 ns - 73 cycles(tsc) 18.372 ns - improved 24.0% 250 - 100 cycles(tsc) 25.007 ns - 73 cycles(tsc) 18.430 ns - improved 27.0% Patchset based on top of commit aefbef10e3ae with previous accepted bulk patchset(V2) applied (avail in AKPMs quilt). Small note, benchmark run with kernel compiled with .config CONFIG_FTRACE in-order to use the perf probes to measure the amount of page bulking into __slab_free(). While running the "worse-case" testing module slab_bulk_test03.c --- Jesper Dangaard Brouer (3): slub: extend slowpath __slab_free() to handle bulk free slub: optimize bulk slowpath free by detached freelist slub: build detached freelist with look-ahead mm/slub.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 112 insertions(+), 30 deletions(-) -- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753792AbbHXA6x (ORCPT ); Sun, 23 Aug 2015 20:58:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38294 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753725AbbHXA6w (ORCPT ); Sun, 23 Aug 2015 20:58:52 -0400 Subject: [PATCH V2 1/3] slub: extend slowpath __slab_free() to handle bulk free From: Jesper Dangaard Brouer To: linux-mm@kvack.org, Christoph Lameter , akpm@linux-foundation.org Cc: aravinda@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com, "Paul E. McKenney" , linux-kernel@vger.kernel.org, Jesper Dangaard Brouer Date: Mon, 24 Aug 2015 02:58:48 +0200 Message-ID: <20150824005823.2947.19259.stgit@localhost> In-Reply-To: <20150824005727.2947.36065.stgit@localhost> References: <20150824005727.2947.36065.stgit@localhost> User-Agent: StGit/0.17-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Make it possible to free a freelist with several objects by extending __slab_free() with two arguments: a freelist_head pointer and objects counter (cnt). If freelist_head pointer is set, then the object must be the freelist tail pointer. This allows a freelist with several objects (all within the same slab-page) to be free'ed using a single locked cmpxchg_double. Micro benchmarking showed no performance reduction due to this change. Signed-off-by: Jesper Dangaard Brouer --- V2: Per request of Christoph Lameter * Made it more clear that freelist objs must exist within same page mm/slub.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index c9305f525004..10b57a3bb895 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2573,9 +2573,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); * So we still attempt to reduce cache line usage. Just take the slab * lock and free the item. If there is no additional partial page * handling required then we can return immediately. + * + * Bulk free of a freelist with several objects (all pointing to the + * same page) possible by specifying freelist_head ptr and object as + * tail ptr, plus objects count (cnt). */ static void __slab_free(struct kmem_cache *s, struct page *page, - void *x, unsigned long addr) + void *x, unsigned long addr, + void *freelist_head, int cnt) { void *prior; void **object = (void *)x; @@ -2584,6 +2589,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, unsigned long counters; struct kmem_cache_node *n = NULL; unsigned long uninitialized_var(flags); + void *new_freelist = (!freelist_head) ? object : freelist_head; stat(s, FREE_SLOWPATH); @@ -2601,7 +2607,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, set_freepointer(s, object, prior); new.counters = counters; was_frozen = new.frozen; - new.inuse--; + new.inuse -= cnt; if ((!new.inuse || !prior) && !was_frozen) { if (kmem_cache_has_cpu_partial(s) && !prior) { @@ -2632,7 +2638,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } while (!cmpxchg_double_slab(s, page, prior, counters, - object, new.counters, + new_freelist, new.counters, "__slab_free")); if (likely(!n)) { @@ -2736,7 +2742,7 @@ redo: } stat(s, FREE_FASTPATH); } else - __slab_free(s, page, x, addr); + __slab_free(s, page, x, addr, NULL, 1); } @@ -2780,7 +2786,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) c->tid = next_tid(c->tid); local_irq_enable(); /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_); + __slab_free(s, page, object, _RET_IP_, NULL, 1); local_irq_disable(); c = this_cpu_ptr(s->cpu_slab); } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932157AbbHXA7I (ORCPT ); Sun, 23 Aug 2015 20:59:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38322 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932066AbbHXA7G (ORCPT ); Sun, 23 Aug 2015 20:59:06 -0400 Subject: [PATCH V2 2/3] slub: optimize bulk slowpath free by detached freelist From: Jesper Dangaard Brouer To: linux-mm@kvack.org, Christoph Lameter , akpm@linux-foundation.org Cc: aravinda@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com, "Paul E. McKenney" , linux-kernel@vger.kernel.org, Jesper Dangaard Brouer Date: Mon, 24 Aug 2015 02:59:04 +0200 Message-ID: <20150824005857.2947.51229.stgit@localhost> In-Reply-To: <20150824005727.2947.36065.stgit@localhost> References: <20150824005727.2947.36065.stgit@localhost> User-Agent: StGit/0.17-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This change focus on improving the speed of object freeing in the "slowpath" of kmem_cache_free_bulk. The slowpath call __slab_free() have been extended with support for bulk free, which amortize the overhead of the locked cmpxchg_double_slab. To use the new bulking feature of __slab_free(), we build what I call a detached freelist. The detached freelist takes advantage of three properties: 1) the free function call owns the object that is about to be freed, thus writing into this memory is synchronization-free. 2) many freelist's can co-exist side-by-side in the same page each with a separate head pointer. 3) it is the visibility of the head pointer that needs synchronization. Given these properties, the brilliant part is that the detached freelist can be constructed without any need for synchronization. The freelist is constructed directly in the page objects, without any synchronization needed. The detached freelist is allocated on the stack of the function call kmem_cache_free_bulk. Thus, the freelist head pointer is not visible to other CPUs. This implementation is fairly simple, as it only builds the detached freelist if two consecutive objects belongs to the same page. When detecting object page does not match, it simply flushes the local freelist, and starts a new local detached freelist. It will not look-ahead to see if further opputunities exists in the The next patch have a more advanced look-ahead approach, but is also more complicated. Splitting them up, because I want to be able to benchmark the simple against the advanced approach. Signed-off-by: Jesper Dangaard Brouer --- bulk- Fallback - Bulk API 1 - 64 cycles(tsc) 16.109 ns - 47 cycles(tsc) 11.894 - improved 26.6% 2 - 56 cycles(tsc) 14.158 ns - 45 cycles(tsc) 11.274 - improved 19.6% 3 - 54 cycles(tsc) 13.650 ns - 23 cycles(tsc) 6.001 - improved 57.4% 4 - 53 cycles(tsc) 13.268 ns - 21 cycles(tsc) 5.262 - improved 60.4% 8 - 51 cycles(tsc) 12.841 ns - 18 cycles(tsc) 4.718 - improved 64.7% 16 - 50 cycles(tsc) 12.583 ns - 19 cycles(tsc) 4.896 - improved 62.0% 30 - 85 cycles(tsc) 21.357 ns - 26 cycles(tsc) 6.549 - improved 69.4% 32 - 82 cycles(tsc) 20.690 ns - 25 cycles(tsc) 6.412 - improved 69.5% 34 - 81 cycles(tsc) 20.322 ns - 25 cycles(tsc) 6.365 - improved 69.1% 48 - 93 cycles(tsc) 23.332 ns - 28 cycles(tsc) 7.139 - improved 69.9% 64 - 98 cycles(tsc) 24.544 ns - 62 cycles(tsc) 15.543 - improved 36.7% 128 - 96 cycles(tsc) 24.219 ns - 68 cycles(tsc) 17.143 - improved 29.2% 158 - 107 cycles(tsc) 26.817 ns - 69 cycles(tsc) 17.431 - improved 35.5% 250 - 107 cycles(tsc) 26.824 ns - 70 cycles(tsc) 17.730 - improved 34.6% --- mm/slub.c | 48 +++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 41 insertions(+), 7 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 10b57a3bb895..40e4b5926311 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2756,12 +2756,26 @@ void kmem_cache_free(struct kmem_cache *s, void *x) } EXPORT_SYMBOL(kmem_cache_free); +struct detached_freelist { + struct page *page; + void *freelist; + void *tail_object; + int cnt; +}; + /* Note that interrupts must be enabled when calling this function. */ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) { struct kmem_cache_cpu *c; struct page *page; int i; + /* Opportunistically delay updating page->freelist, hoping + * next free happen to same page. Start building the freelist + * in the page, but keep local stack ptr to freelist. If + * successful several object can be transferred to page with a + * single cmpxchg_double. + */ + struct detached_freelist df = {0}; local_irq_disable(); c = this_cpu_ptr(s->cpu_slab); @@ -2778,22 +2792,42 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) page = virt_to_head_page(object); - if (c->page == page) { + if (page == df.page) { + /* Oppotunity to delay real free */ + set_freepointer(s, object, df.freelist); + df.freelist = object; + df.cnt++; + } else if (c->page == page) { /* Fastpath: local CPU free */ set_freepointer(s, object, c->freelist); c->freelist = object; } else { - c->tid = next_tid(c->tid); - local_irq_enable(); - /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_, NULL, 1); - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); + /* Slowpath: Flush delayed free */ + if (df.page) { + c->tid = next_tid(c->tid); + local_irq_enable(); + __slab_free(s, df.page, df.tail_object, + _RET_IP_, df.freelist, df.cnt); + local_irq_disable(); + c = this_cpu_ptr(s->cpu_slab); + } + /* Start new round of delayed free */ + df.page = page; + df.tail_object = object; + set_freepointer(s, object, NULL); + df.freelist = object; + df.cnt = 1; } } exit: c->tid = next_tid(c->tid); local_irq_enable(); + + /* Flush detached freelist */ + if (df.page) { + __slab_free(s, df.page, df.tail_object, + _RET_IP_, df.freelist, df.cnt); + } } EXPORT_SYMBOL(kmem_cache_free_bulk); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932192AbbHXA7b (ORCPT ); Sun, 23 Aug 2015 20:59:31 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38369 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932066AbbHXA7a (ORCPT ); Sun, 23 Aug 2015 20:59:30 -0400 Subject: [PATCH V2 3/3] slub: build detached freelist with look-ahead From: Jesper Dangaard Brouer To: linux-mm@kvack.org, Christoph Lameter , akpm@linux-foundation.org Cc: aravinda@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com, "Paul E. McKenney" , linux-kernel@vger.kernel.org, Jesper Dangaard Brouer Date: Mon, 24 Aug 2015 02:59:27 +0200 Message-ID: <20150824005911.2947.50857.stgit@localhost> In-Reply-To: <20150824005727.2947.36065.stgit@localhost> References: <20150824005727.2947.36065.stgit@localhost> User-Agent: StGit/0.17-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This change is a more advanced use of detached freelist. The bulk free array is scanned is a progressive manor with a limited look-ahead facility. To maintain the same performance level, as the previous simple implementation, the look-ahead have been limited to only 3 objects. This number have been determined my experimental micro benchmarking. For performance the free loop in kmem_cache_free_bulk have been significantly reorganized, with a focus on making the branches more predictable for the compiler. E.g. the per CPU c->freelist is also build as a detached freelist, even-though it would be just as fast as freeing directly to it, but it save creating an unpredictable branch. Another benefit of this change is that kmem_cache_free_bulk() runs mostly with IRQs enabled. The local IRQs are only disabled when updating the per CPU c->freelist. This should please Thomas Gleixner. Pitfall(1): Removed kmem debug support. Pitfall(2): No BUG_ON() freeing NULL pointers, but the algorithm handles and skips these NULL pointers. Compare against previous patch: There is some fluctuation in the benchmarks between runs. To counter this I've run some specific[1] bulk sizes, repeated 100 times and run dmesg through Rusty's "stats"[2] tool. Command line: sudo dmesg -c ;\ for x in `seq 100`; do \ modprobe slab_bulk_test02 bulksz=48 loops=100000 && rmmod slab_bulk_test02; \ echo $x; \ sleep 0.${RANDOM} ;\ done; \ dmesg | stats Results: bulk size:16, average: +2.01 cycles Prev: between 19-52 (average: 22.65 stddev:+/-6.9) This: between 19-67 (average: 24.67 stddev:+/-9.9) bulk size:48, average: +1.54 cycles Prev: between 23-45 (average: 27.88 stddev:+/-4) This: between 24-41 (average: 29.42 stddev:+/-3.7) bulk size:144, average: +1.73 cycles Prev: between 44-76 (average: 60.31 stddev:+/-7.7) This: between 49-80 (average: 62.04 stddev:+/-7.3) bulk size:512, average: +8.94 cycles Prev: between 50-68 (average: 60.11 stddev: +/-4.3) This: between 56-80 (average: 69.05 stddev: +/-5.2) bulk size:2048, average: +26.81 cycles Prev: between 61-73 (average: 68.10 stddev:+/-2.9) This: between 90-104(average: 94.91 stddev:+/-2.1) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test02.c [2] https://github.com/rustyrussell/stats Signed-off-by: Jesper Dangaard Brouer --- bulk- Fallback - Bulk API 1 - 64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6% 2 - 57 cycles(tsc) 14.397 ns - 29 cycles(tsc) 7.368 - improved 49.1% 3 - 55 cycles(tsc) 13.797 ns - 24 cycles(tsc) 6.003 - improved 56.4% 4 - 53 cycles(tsc) 13.500 ns - 22 cycles(tsc) 5.543 - improved 58.5% 8 - 52 cycles(tsc) 13.008 ns - 20 cycles(tsc) 5.047 - improved 61.5% 16 - 51 cycles(tsc) 12.763 ns - 20 cycles(tsc) 5.015 - improved 60.8% 30 - 50 cycles(tsc) 12.743 ns - 20 cycles(tsc) 5.062 - improved 60.0% 32 - 51 cycles(tsc) 12.908 ns - 20 cycles(tsc) 5.089 - improved 60.8% 34 - 87 cycles(tsc) 21.936 ns - 28 cycles(tsc) 7.006 - improved 67.8% 48 - 79 cycles(tsc) 19.840 ns - 31 cycles(tsc) 7.755 - improved 60.8% 64 - 86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9% 128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7% 158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8% 250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6% --- mm/slub.c | 138 ++++++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 90 insertions(+), 48 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 40e4b5926311..49ae96f45670 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2763,71 +2763,113 @@ struct detached_freelist { int cnt; }; -/* Note that interrupts must be enabled when calling this function. */ -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +/* + * This function extract objects belonging to the same page, and + * builds a detached freelist directly within the given page/objects. + * This can happen without any need for synchronization, because the + * objects are owned by running process. The freelist is build up as + * a single linked list in the objects. The idea is, that this + * detached freelist can then be bulk transferred to the real + * freelist(s), but only requiring a single synchronization primitive. + */ +static inline int build_detached_freelist( + struct kmem_cache *s, size_t size, void **p, + struct detached_freelist *df, int start_index) { - struct kmem_cache_cpu *c; struct page *page; int i; - /* Opportunistically delay updating page->freelist, hoping - * next free happen to same page. Start building the freelist - * in the page, but keep local stack ptr to freelist. If - * successful several object can be transferred to page with a - * single cmpxchg_double. - */ - struct detached_freelist df = {0}; + int lookahead = 0; + void *object; - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); + /* Always re-init detached_freelist */ + do { + object = p[start_index]; + if (object) { + /* Start new delayed freelist */ + df->page = virt_to_head_page(object); + df->tail_object = object; + set_freepointer(s, object, NULL); + df->freelist = object; + df->cnt = 1; + p[start_index] = NULL; /* mark object processed */ + } else { + df->page = NULL; /* Handle NULL ptr in array */ + } + start_index++; + } while (!object && start_index < size); - for (i = 0; i < size; i++) { - void *object = p[i]; + for (i = start_index; i < size; i++) { + object = p[i]; - BUG_ON(!object); - /* kmem cache debug support */ - s = cache_from_obj(s, object); - if (unlikely(!s)) - goto exit; - slab_free_hook(s, object); + if (!object) + continue; /* Skip processed objects */ page = virt_to_head_page(object); - if (page == df.page) { - /* Oppotunity to delay real free */ - set_freepointer(s, object, df.freelist); - df.freelist = object; - df.cnt++; - } else if (c->page == page) { - /* Fastpath: local CPU free */ - set_freepointer(s, object, c->freelist); - c->freelist = object; + /* df->page is always set at this point */ + if (page == df->page) { + /* Oppotunity build freelist */ + set_freepointer(s, object, df->freelist); + df->freelist = object; + df->cnt++; + p[i] = NULL; /* mark object processed */ + if (!lookahead) + start_index++; } else { - /* Slowpath: Flush delayed free */ - if (df.page) { + /* Limit look ahead search */ + if (++lookahead >= 3) + return start_index; + continue; + } + } + return start_index; +} + +/* Note that interrupts must be enabled when calling this function. */ +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +{ + struct kmem_cache_cpu *c; + int iterator = 0; + struct detached_freelist df; + + BUG_ON(!size); + + /* Per CPU ptr may change afterwards */ + c = this_cpu_ptr(s->cpu_slab); + + while (likely(iterator < size)) { + iterator = build_detached_freelist(s, size, p, &df, iterator); + if (likely(df.page)) { + redo: + if (c->page == df.page) { + /* + * Local CPU free require disabling + * IRQs. It is possible to miss the + * oppotunity and instead free to + * page->freelist, but it does not + * matter as page->freelist will + * eventually be transferred to + * c->freelist + */ + local_irq_disable(); + c = this_cpu_ptr(s->cpu_slab); /* reload */ + if (c->page != df.page) { + local_irq_enable(); + goto redo; + } + /* Bulk transfer to CPU c->freelist */ + set_freepointer(s, df.tail_object, c->freelist); + c->freelist = df.freelist; + c->tid = next_tid(c->tid); local_irq_enable(); + } else { + /* Bulk transfer to page->freelist */ __slab_free(s, df.page, df.tail_object, _RET_IP_, df.freelist, df.cnt); - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); } - /* Start new round of delayed free */ - df.page = page; - df.tail_object = object; - set_freepointer(s, object, NULL); - df.freelist = object; - df.cnt = 1; } } -exit: - c->tid = next_tid(c->tid); - local_irq_enable(); - - /* Flush detached freelist */ - if (df.page) { - __slab_free(s, df.page, df.tail_object, - _RET_IP_, df.freelist, df.cnt); - } } EXPORT_SYMBOL(kmem_cache_free_bulk); From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. Date: Fri, 04 Sep 2015 19:00:34 +0200 Message-ID: <20150904165944.4312.32435.stgit@devil> References: <20150824005727.2947.36065.stgit@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Jesper Dangaard Brouer , aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com To: netdev@vger.kernel.org, akpm@linux-foundation.org Return-path: Received: from mx1.redhat.com ([209.132.183.28]:45062 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760193AbbIDRAh (ORCPT ); Fri, 4 Sep 2015 13:00:37 -0400 In-Reply-To: <20150824005727.2947.36065.stgit@localhost> Sender: netdev-owner@vger.kernel.org List-ID: During TX DMA completion cleanup there exist an opportunity in the NIC drivers to perform bulk free, without introducing additional latency. For an IPv4 forwarding workload the network stack is hitting the slowpath of the kmem_cache "slub" allocator. This slowpath can be mitigated by bulk free via the detached freelists patchset. Depend on patchset: http://thread.gmane.org/gmane.linux.kernel.mm/137469 Kernel based on MMOTM tag 2015-08-24-16-12 from git repo: git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git Also contains Christoph's patch "slub: Avoid irqoff/on in bulk allocation" Benchmarking: Single CPU IPv4 forwarding UDP (generator pktgen): * Before: 2043575 pps * After : 2090522 pps * Improvements: +46947 pps and -10.99 ns In the before case, perf report shows slub free hits the slowpath: 1.98% ksoftirqd/6 [kernel.vmlinux] [k] __slab_free.isra.72 1.29% ksoftirqd/6 [kernel.vmlinux] [k] cmpxchg_double_slab.isra.71 0.95% ksoftirqd/6 [kernel.vmlinux] [k] kmem_cache_free 0.95% ksoftirqd/6 [kernel.vmlinux] [k] kmem_cache_alloc 0.20% ksoftirqd/6 [kernel.vmlinux] [k] __cmpxchg_double_slab.isra.60 0.17% ksoftirqd/6 [kernel.vmlinux] [k] ___slab_alloc.isra.68 0.09% ksoftirqd/6 [kernel.vmlinux] [k] __slab_alloc.isra.69 After the slowpath calls are almost gone: 0.22% ksoftirqd/6 [kernel.vmlinux] [k] __cmpxchg_double_slab.isra.60 0.18% ksoftirqd/6 [kernel.vmlinux] [k] ___slab_alloc.isra.68 0.14% ksoftirqd/6 [kernel.vmlinux] [k] __slab_free.isra.72 0.14% ksoftirqd/6 [kernel.vmlinux] [k] cmpxchg_double_slab.isra.71 0.08% ksoftirqd/6 [kernel.vmlinux] [k] __slab_alloc.isra.69 Extra info, tuning SLUB per CPU structures gives further improvements: * slub-tuned: 2124217 pps * patched increase: +33695 pps and -7.59 ns * before increase: +80642 pps and -18.58 ns Tuning done: echo 256 > /sys/kernel/slab/skbuff_head_cache/cpu_partial echo 9 > /sys/kernel/slab/skbuff_head_cache/min_partial Without SLUB tuning, same performance comes with kernel cmdline "slab_nomerge": * slab_nomerge: 2121824 pps Test notes: * Notice very fast CPU i7-4790K CPU @ 4.00GHz * gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) * kernel 4.1.0-mmotm-2015-08-24-16-12+ #271 SMP * Generator pktgen UDP single flow (pktgen_sample03_burst_single_flow.sh) * Tuned for forwarding: - unloaded netfilter modules - Sysctl settings: - net/ipv4/conf/default/rp_filter = 0 - net/ipv4/conf/all/rp_filter = 0 - (Forwarding performance is affected by early demux) - net/ipv4/ip_early_demux = 0 - net.ipv4.ip_forward = 1 - Disabled GRO on NICs - ethtool -K ixgbe3 gro off tso off gso off --- Jesper Dangaard Brouer (3): net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() net: NIC helper API for building array of skbs to free ixgbe: bulk free SKBs during TX completion cleanup cycle drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 13 +++- include/linux/netdevice.h | 62 ++++++++++++++++++ include/linux/skbuff.h | 1 net/core/skbuff.c | 87 ++++++++++++++++++++----- 4 files changed, 144 insertions(+), 19 deletions(-) -- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free Date: Fri, 04 Sep 2015 19:01:06 +0200 Message-ID: <20150904170104.4312.47707.stgit@devil> References: <20150904165944.4312.32435.stgit@devil> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Jesper Dangaard Brouer , aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com To: netdev@vger.kernel.org, akpm@linux-foundation.org Return-path: Received: from mx1.redhat.com ([209.132.183.28]:58923 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760473AbbIDRBI (ORCPT ); Fri, 4 Sep 2015 13:01:08 -0400 In-Reply-To: <20150904165944.4312.32435.stgit@devil> Sender: netdev-owner@vger.kernel.org List-ID: The NIC device drivers are expected to use this small helper API, when building up an array of objects/skbs to bulk free, while (loop) processing objects to free. Objects to be free'ed later is added (dev_free_waitlist_add) to an array and flushed if the array runs full. After processing the array is flushed (dev_free_waitlist_flush). The array should be stored on the local stack. Usage e.g. during TX completion loop the NIC driver can replace dev_consume_skb_any() with an "add" and after the loop a "flush". For performance reasons the compiler should inline most of these functions. Signed-off-by: Jesper Dangaard Brouer --- include/linux/netdevice.h | 62 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 05b9a694e213..d0133e778314 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2935,6 +2935,68 @@ static inline void dev_consume_skb_any(struct sk_buff *skb) __dev_kfree_skb_any(skb, SKB_REASON_CONSUMED); } +/* The NIC device drivers are expected to use this small helper API, + * when building up an array of objects/skbs to bulk free, while + * (loop) processing objects to free. Objects to be free'ed later is + * added (dev_free_waitlist_add) to an array and flushed if the array + * runs full. After processing the array is flushed (dev_free_waitlist_flush). + * The array should be stored on the local stack. + * + * Usage e.g. during TX completion loop the NIC driver can replace + * dev_consume_skb_any() with an "add" and after the loop a "flush". + * + * For performance reasons the compiler should inline most of these + * functions. + */ +struct dev_free_waitlist { + struct sk_buff **skbs; + unsigned int skb_cnt; +}; + +static void __dev_free_waitlist_bulkfree(struct dev_free_waitlist *wl) +{ + /* Cannot bulk free from interrupt context or with IRQs + * disabled, due to how SLAB bulk API works (and gain it's + * speedup). This can e.g. happen due to invocation from + * netconsole/netpoll. + */ + if (unlikely(in_irq() || irqs_disabled())) { + int i; + + for (i = 0; i < wl->skb_cnt; i++) + dev_consume_skb_irq(wl->skbs[i]); + } else { + /* Likely fastpath, don't call with cnt == 0 */ + kfree_skb_bulk(wl->skbs, wl->skb_cnt); + } +} + +static inline void dev_free_waitlist_flush(struct dev_free_waitlist *wl) +{ + /* Flush the waitlist, but only if any objects remain, as bulk + * freeing "zero" objects is not supported and plus it avoids + * pointless function calls. + */ + if (likely(wl->skb_cnt)) + __dev_free_waitlist_bulkfree(wl); +} + +static __always_inline void dev_free_waitlist_add(struct dev_free_waitlist *wl, + struct sk_buff *skb, + unsigned int max) +{ + /* It is recommended that max is a builtin constant, as this + * saves one register when inlined. Catch offenders with: + * BUILD_BUG_ON(!__builtin_constant_p(max)); + */ + wl->skbs[wl->skb_cnt++] = skb; + if (wl->skb_cnt == max) { + /* Detect when waitlist array is full, then flush and reset */ + __dev_free_waitlist_bulkfree(wl); + wl->skb_cnt = 0; + } +} + int netif_rx(struct sk_buff *skb); int netif_rx_ni(struct sk_buff *skb); int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb); From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Date: Mon, 7 Sep 2015 10:41:01 +0200 Message-ID: <20150907104101.3e392a6d@redhat.com> References: <20150904165944.4312.32435.stgit@devil> <20150904170046.4312.38018.stgit@devil> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Linux Kernel Network Developers , akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com, brouer@redhat.com To: Tom Herbert Return-path: Received: from mx1.redhat.com ([209.132.183.28]:51875 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752543AbbIGIlJ (ORCPT ); Mon, 7 Sep 2015 04:41:09 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 4 Sep 2015 11:47:17 -0700 Tom Herbert wrote: > On Fri, Sep 4, 2015 at 10:00 AM, Jesper Dangaard Brouer wrote: > > Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(), > > in the network stack in form of function kfree_skb_bulk() which bulk > > free SKBs (not skb clones or skb->head, yet). > > [...] > > +/** > > + * kfree_skb_bulk - bulk free SKBs when refcnt allows to > > + * @skbs: array of SKBs to free > > + * @size: number of SKBs in array > > + * > > + * If SKB refcnt allows for free, then release any auxiliary data > > + * and then bulk free SKBs to the SLAB allocator. > > + * > > + * Note that interrupts must be enabled when calling this function. > > + */ > > +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size) > > +{ > > What not pass a list of skbs (e.g. using skb->next)? Because the next layer, the slab API needs an array: kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) Look at the patch: [PATCH V2 3/3] slub: build detached freelist with look-ahead http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=137472 Where I use this array to progressively scan for objects belonging to the same page. (A subtle detail is I manage to zero out the array, which is good from a security/error-handling point of view, as pointers to the objects are not left dangling on the stack). I cannot argue that, writing skb->next comes as an additional cost, because the slUb free also writes into this cacheline. Perhaps the slAb allocator does not? [...] > > + if (likely(cnt)) { > > + kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs); > > + } > > +} > > +EXPORT_SYMBOL(kfree_skb_bulk); -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Date: Mon, 7 Sep 2015 22:14:48 +0200 Message-ID: <20150907221448.2b18b174@redhat.com> References: <20150904165944.4312.32435.stgit@devil> <20150904170046.4312.38018.stgit@devil> <20150907104101.3e392a6d@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Linux Kernel Network Developers , akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com, brouer@redhat.com To: Tom Herbert Return-path: Received: from mx1.redhat.com ([209.132.183.28]:38828 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751091AbbIGUO4 (ORCPT ); Mon, 7 Sep 2015 16:14:56 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 7 Sep 2015 09:25:49 -0700 Tom Herbert wrote: > >> What not pass a list of skbs (e.g. using skb->next)? > > > > Because the next layer, the slab API needs an array: > > kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > > > > I suppose we could ask the same question of that function. IMO > encouraging drivers to define arrays of pointers on the stack like > you're doing in the ixgbe patch is a bad direction. > > In any case I believe this would be simpler in the networking side > just to maintain a list of skb's to free. Then the dev_free_waitlist > structure might not be needed then since we could just use a > skb_buf_head for that. I guess it is more natural for the network side to work with skb lists. But I'm keeping it for slab/slub as we cannot assume/enforce objects of a specific data type. I worried about how large bulk free we should allow, due to the interaction with skb->destructor which for sockets affect their memory accounting. E.g. we have seen issues with hypervisor network drivers (Xen and HyperV) that are too slow to cleanup their TX completion queue that their TCP bandwidth get limited by tcp_limit_output_bytes. I capped it at 32, and the NAPI budget will cap it at 64. By the following argument, bulk free of 64 objects/skb's is not a problem. The delay I'm introducing is very small, before the first real kfree_skb is called, which calls the destructor with free up socket memory accounting. Assume measured packet rate of: 2105011 pps Time between packets (1/2105011*10^9): 475 ns Perf shows ixgbe_clean_tx_irq() takes: 1.23% Extrapolating the function call cost: 5.84 ns (475*(1.23/100)) Processing 64 packets in ixgbe_clean_tx_irq() 373 ns. At 10Gbit/s how many bytes can arrive in this period, only: 466 bytes. ((373/10^9)*(10000*10^6)/8) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Duyck Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. Date: Mon, 7 Sep 2015 14:23:39 -0700 Message-ID: <55EE005B.9080802@gmail.com> References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <20150907101610.44504597@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, Christoph Lameter , "Paul E. McKenney" , iamjoonsoo.kim@lge.com To: Jesper Dangaard Brouer Return-path: Received: from mail-pa0-f50.google.com ([209.85.220.50]:33652 "EHLO mail-pa0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750897AbbIGVXl (ORCPT ); Mon, 7 Sep 2015 17:23:41 -0400 Received: by pacex6 with SMTP id ex6so104771517pac.0 for ; Mon, 07 Sep 2015 14:23:41 -0700 (PDT) In-Reply-To: <20150907101610.44504597@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On 09/07/2015 01:16 AM, Jesper Dangaard Brouer wrote: > On Fri, 4 Sep 2015 11:09:21 -0700 > Alexander Duyck wrote: > >> This is an interesting start. However I feel like it might work better >> if you were to create a per-cpu pool for skbs that could be freed and >> allocated in NAPI context. So for example we already have >> napi_alloc_skb, why not just add a napi_free_skb > I do like the idea... If nothing else you want to avoid having to redo this code for every driver. If you can just replace dev_kfree_skb with some other freeing call it will make it much easier to convert other drivers. >> and then make the array >> of objects to be freed part of a pool that could be used for either >> allocation or freeing? If the pool runs empty you just allocate >> something like 8 or 16 new skb heads, and if you fill it you just free >> half of the list? > But I worry that this algorithm will "randomize" the (skb) objects. > And the SLUB bulk optimization only works if we have many objects > belonging to the same page. Agreed to some extent, however at the same time what this does is allow for a certain amount of skb recycling. So instead of freeing the buffers received from the socket you would likely be recycling them and sending them back as Rx skbs. In the case of a heavy routing workload you would likely just be cycling through the same set of buffers and cleaning them off of transmit and placing them back on receive. The general idea is to keep the memory footprint small so recycling Tx buffers to use for Rx can have its advantages in terms of keeping things confined to limits of the L1/L2 cache. > It would likely be fastest to implement a simple stack (for these > per-cpu pools), but I again worry that it would randomize the > object-pages. A simple queue might be better, but slightly slower. > Guess I could just reuse part of qmempool / alf_queue as a quick test. I would say don't over engineer it. A stack is the simplest. The qmempool / alf_queue is just going to add extra overhead. The added advantage to the stack is that you are working with pointers and you are guaranteed that the list of pointers are going to be linear. If you use a queue clean-up will require up to 2 blocks of freeing in case the ring has wrapped. > Having a per-cpu pool in networking would solve the problem of the slub > per-cpu pool isn't large enough for our use-case. On the other hand, > maybe we should fix slub to dynamically adjust the size of it's per-cpu > resources? The per-cpu pool is just meant to replace the the per-driver pool you were using. By using a per-cpu pool you would get better aggregation and can just flush the freed buffers at the end of the Rx softirq or when the pool is full instead of having to flush smaller lists per call to napi->poll. > A pre-req knowledge (for people not knowing slub's internal details): > Slub alloc path will pickup a page, and empty all objects for that page > before proceeding to the next page. Thus, slub bulk alloc will give > many objects belonging to the page. I'm trying to keep these objects > grouped together until they can be free'ed in a bulk. The problem is you aren't going to be able to keep them together very easily. Yes they might be allocated all from one spot on Rx but they can very easily end up scattered to multiple locations. The same applies to Tx where you will have multiple flows all outgoing on one port. That is why I was thinking adding some skb recycling via a per-cpu stack might be useful especially since you have to either fill or empty the stack when you allocate or free multiple skbs anyway. In addition it provides an easy way for a bulk alloc and a bulk free to share data structures without adding additional overhead by keeping them separate. If you managed it with some sort of high-water/low-water mark type setup you could very well keep the bulk-alloc/free busy without too much fragmentation. For the socket transmit/receive case the thing you have to keep in mind is that if you reuse the buffers you are just going to be throwing them back at the sockets which are likely not using bulk-free anyway. So in that case reuse could actually improve things by simply reducing the number of calls to bulk-alloc you will need to make since things like TSO allow you to send 64K using a single sk_buff, while you will be likely be receiving one or more acks on the receive side which will require allocations. - Alex From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Lameter Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. Date: Tue, 8 Sep 2015 12:32:40 -0500 (CDT) Message-ID: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <55EA0172.2040505@gmail.com> <20150905131825.6c04837d@redhat.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Alexander Duyck , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com To: Jesper Dangaard Brouer Return-path: Received: from resqmta-ch2-03v.sys.comcast.net ([69.252.207.35]:43580 "EHLO resqmta-ch2-03v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752709AbbIHRcl (ORCPT ); Tue, 8 Sep 2015 13:32:41 -0400 In-Reply-To: <20150905131825.6c04837d@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Sat, 5 Sep 2015, Jesper Dangaard Brouer wrote: > The double_cmpxchg without lock prefix still cost 9 cycles, which is > very fast but still a cost (add approx 19 cycles for a lock prefix). > > It is slower than local_irq_disable + local_irq_enable that only cost > 7 cycles, which the bulking call uses. (That is the reason bulk calls > with 1 object can almost compete with fastpath). Hmmm... Guess we need to come up with distinct version of kmalloc() for irq and non irq contexts to take advantage of that . Most at non irq context anyways. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Lameter Subject: Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. Date: Wed, 9 Sep 2015 09:08:47 -0500 (CDT) Message-ID: References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> <55E9DE51.7090109@gmail.com> <55EA0172.2040505@gmail.com> <20150905131825.6c04837d@redhat.com> <20150909145919.4d68ea36@redhat.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Alexander Duyck , netdev@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, aravinda@linux.vnet.ibm.com, "Paul E. McKenney" , iamjoonsoo.kim@lge.com To: Jesper Dangaard Brouer Return-path: Received: from resqmta-ch2-08v.sys.comcast.net ([69.252.207.40]:53727 "EHLO resqmta-ch2-08v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751469AbbIIOIu (ORCPT ); Wed, 9 Sep 2015 10:08:50 -0400 In-Reply-To: <20150909145919.4d68ea36@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, 9 Sep 2015, Jesper Dangaard Brouer wrote: > > Hmmm... Guess we need to come up with distinct version of kmalloc() for > > irq and non irq contexts to take advantage of that . Most at non irq > > context anyways. > > I agree, it would be an easy win. Do notice this will have the most > impact for the slAb allocator. > > I estimate alloc + free cost would save: > * slAb would save approx 60 cycles > * slUb would save approx 4 cycles > > We might consider keeping the slUb approach as it would be more > friendly for RT with less IRQ disabling. IRQ disabling it a mixed bag. Older cpus have higher latencies there and also virtualized contexts may require the hypervisor tracks the interrupt state. For recent intel cpus this is certainly a workable approach. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Experiences with slub bulk use-case for network stack Date: Wed, 16 Sep 2015 12:02:30 +0200 Message-ID: <20150916120230.4ca75217@redhat.com> References: <20150824005727.2947.36065.stgit@localhost> <20150904165944.4312.32435.stgit@devil> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, akpm@linux-foundation.org, Alexander Duyck , iamjoonsoo.kim@lge.com To: linux-mm@kvack.org, Christoph Lameter Return-path: Received: from mx1.redhat.com ([209.132.183.28]:57528 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752135AbbIPKCf (ORCPT ); Wed, 16 Sep 2015 06:02:35 -0400 In-Reply-To: <20150904165944.4312.32435.stgit@devil> Sender: netdev-owner@vger.kernel.org List-ID: Hint, this leads up to discussing if current bulk *ALLOC* API need to be changed... Alex and I have been working hard on practical use-case for SLAB bulking (mostly slUb), in the network stack. Here is a summary of what we have learned so far. Bulk free'ing SKBs during TX completion is a big and easy win. Specifically for slUb, normal path for freeing these objects (which are not on c->freelist) require a locked double_cmpxchg per object. The bulk free (via detached freelist patch) allow to free all objects belonging to the same slab-page, to be free'ed with a single locked double_cmpxchg. Thus, the bulk free speedup is quite an improvement. The slUb alloc is hard to beat on speed: * accessing c->freelist, local cmpxchg 9 cycles (38% of cost) * c->freelist is refilled with single locked cmpxchg In micro benchmarking it looks like we can beat alloc, because we do a local_irq_{disable,enable} (cost 7 cycles). And then pull out all objects in c->freelist. Thus, saving 9 cycles per object (counting from the 2nd object). However, in practical use-cases we are seeing the single object alloc win over bulk alloc, we believe this to be due to prefetching. When c->freelist get (semi) cache-cold, then it gets more expensive to walk the freelist (which is a basic single linked list to next free object). For bulk alloc the full freelist is walked (right-way) and objects pulled out into the array. For normal single object alloc only a single object is returned, but it does a prefetch on the next object pointer. Thus, next time single alloc is called the object will have been prefetched. Doing prefetch in bulk alloc only helps a little, as it does not have enough "time" between accessing/walking the freelist for objects. So, how can we solve this and make bulk alloc faster? Alex and I had the idea of bulk alloc returns an "allocator specific cache" data-structure (and we add some helpers to access this). In the slUb case, the freelist is a single linked pointer list. In the network stack the skb objects have a skb->next pointer, which is located at the same position as freelist pointer. Thus, simply returning the freelist directly, could be interpreted as a skb-list. The helper API would then do the prefetching, when pulling out objects. For the slUb case, we would simply cmpxchg either c->freelist or page->freelist with a NULL ptr, and then own all objects on the freelist. This also reduce the time we keep IRQs disabled. API wise, we don't (necessary) know how many objects are on the freelist (without first walking the list, which would cause stalls on data, which we are trying to avoid). Thus, the API of always returning the exact number of requested objects will not work... -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer (related to http://thread.gmane.org/gmane.linux.kernel.mm/137469)