From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f54.google.com (mail-qg0-f54.google.com [209.85.192.54]) by kanga.kvack.org (Postfix) with ESMTP id 8BFA66B0038 for ; Mon, 28 Sep 2015 08:26:08 -0400 (EDT) Received: by qgez77 with SMTP id z77so118864341qge.1 for ; Mon, 28 Sep 2015 05:26:08 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 141si15198894qhx.1.2015.09.28.05.26.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 05:26:07 -0700 (PDT) Subject: [PATCH 0/7] Further optimizing SLAB/SLUB bulking From: Jesper Dangaard Brouer Date: Mon, 28 Sep 2015 14:26:04 +0200 Message-ID: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim Most important part of this patchset is the introducing of what I call detached freelist, for improving SLUB performance of object freeing in the "slowpath" of kmem_cache_free_bulk. Previous patchset V2 thread: http://thread.gmane.org/gmane.linux.kernel.mm/137469 Not using V3 tag as patch titles have changed and I've merged some patches. This was joint work with Alexander Duyck while still at Red Hat. Notes for patches: * First two patches (from Christoph) are already in AKPM MMOTS. * Patch 3 is trivial * Patch 4 is a repost, implements bulking for SLAB. - http://thread.gmane.org/gmane.linux.kernel.mm/138220 * Patch 5 and 6 are the important patches - Patch 5 handle "freelists" in slab_free() and __slab_free(). - Patch 6 intro detached freelists, and significant performance improvement Patches should be ready for the MM-tree, as I'm now handling kmem debug support. Based on top of commit 519f526d39 in net-next, but I've tested it applies on top of mmotm-2015-09-18-16-08. The benchmarking tools are avail here: https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm See: slab_bulk_test0{1,2,3}.c This patchset is part of my network stack use-case. I'll post the network side of the patchset as soon as I've cleaned it up, rebased it on net-next and re-run all the benchmarks. --- Christoph Lameter (2): slub: create new ___slab_alloc function that can be called with irqs disabled slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer (5): slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG slab: implement bulking for SLAB allocator slub: support for bulk free with SLUB freelists slub: optimize bulk slowpath free by detached freelist slub: do prefetching in kmem_cache_alloc_bulk() mm/slab.c | 87 ++++++++++++++----- mm/slub.c | 276 +++++++++++++++++++++++++++++++++++++++++++++---------------- 2 files changed, 267 insertions(+), 96 deletions(-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f54.google.com (mail-qg0-f54.google.com [209.85.192.54]) by kanga.kvack.org (Postfix) with ESMTP id 32FE96B0255 for ; Mon, 28 Sep 2015 08:26:13 -0400 (EDT) Received: by qgx61 with SMTP id 61so118345673qgx.3 for ; Mon, 28 Sep 2015 05:26:13 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id b21si15234300qkj.22.2015.09.28.05.26.12 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 05:26:12 -0700 (PDT) Subject: [PATCH 1/7] slub: create new ___slab_alloc function that can be called with irqs disabled From: Jesper Dangaard Brouer Date: Mon, 28 Sep 2015 14:26:09 +0200 Message-ID: <20150928122609.15409.30765.stgit@canyon> In-Reply-To: <20150928122444.15409.10498.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim From: Christoph Lameter NOTICE: Accepted by AKPM http://ozlabs.org/~akpm/mmots/broken-out/slub-create-new-___slab_alloc-function-that-can-be-called-with-irqs-disabled.patch Bulk alloc needs a function like that because it enables interrupts before calling __slab_alloc which promptly disables them again using the expensive local_irq_save(). Signed-off-by: Christoph Lameter Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 44 +++++++++++++++++++++++++++++--------------- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index f614b5dc396b..02cfb3a5983e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2298,23 +2298,15 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page) * And if we were unable to get a new slab from the partial slab lists then * we need to allocate a new slab. This is the slowest path since it involves * a call to the page allocator and the setup of a new slab. + * + * Version of __slab_alloc to use when we know that interrupts are + * already disabled (which is the case for bulk allocation). */ -static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, +static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr, struct kmem_cache_cpu *c) { void *freelist; struct page *page; - unsigned long flags; - - local_irq_save(flags); -#ifdef CONFIG_PREEMPT - /* - * We may have been preempted and rescheduled on a different - * cpu before disabling interrupts. Need to reload cpu area - * pointer. - */ - c = this_cpu_ptr(s->cpu_slab); -#endif page = c->page; if (!page) @@ -2372,7 +2364,6 @@ load_freelist: VM_BUG_ON(!c->page->frozen); c->freelist = get_freepointer(s, freelist); c->tid = next_tid(c->tid); - local_irq_restore(flags); return freelist; new_slab: @@ -2389,7 +2380,6 @@ new_slab: if (unlikely(!freelist)) { slab_out_of_memory(s, gfpflags, node); - local_irq_restore(flags); return NULL; } @@ -2405,11 +2395,35 @@ new_slab: deactivate_slab(s, page, get_freepointer(s, freelist)); c->page = NULL; c->freelist = NULL; - local_irq_restore(flags); return freelist; } /* + * Another one that disabled interrupt and compensates for possible + * cpu changes by refetching the per cpu area pointer. + */ +static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, + unsigned long addr, struct kmem_cache_cpu *c) +{ + void *p; + unsigned long flags; + + local_irq_save(flags); +#ifdef CONFIG_PREEMPT + /* + * We may have been preempted and rescheduled on a different + * cpu before disabling interrupts. Need to reload cpu area + * pointer. + */ + c = this_cpu_ptr(s->cpu_slab); +#endif + + p = ___slab_alloc(s, gfpflags, node, addr, c); + local_irq_restore(flags); + return p; +} + +/* * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc) * have the fastpath folded into their functions. So no function call * overhead for requests that can be satisfied on the fastpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f53.google.com (mail-qg0-f53.google.com [209.85.192.53]) by kanga.kvack.org (Postfix) with ESMTP id 704776B0257 for ; Mon, 28 Sep 2015 08:26:17 -0400 (EDT) Received: by qgez77 with SMTP id z77so118867787qge.1 for ; Mon, 28 Sep 2015 05:26:17 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 141si15199463qhx.1.2015.09.28.05.26.16 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 05:26:16 -0700 (PDT) Subject: [PATCH 2/7] slub: Avoid irqoff/on in bulk allocation From: Jesper Dangaard Brouer Date: Mon, 28 Sep 2015 14:26:14 +0200 Message-ID: <20150928122614.15409.89190.stgit@canyon> In-Reply-To: <20150928122444.15409.10498.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim From: Christoph Lameter NOTICE: Accepted by AKPM http://ozlabs.org/~akpm/mmots/broken-out/slub-avoid-irqoff-on-in-bulk-allocation.patch Use the new function that can do allocation while interrupts are disabled. Avoids irq on/off sequences. Signed-off-by: Christoph Lameter Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 02cfb3a5983e..024eed32da2c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2821,30 +2821,23 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void *object = c->freelist; if (unlikely(!object)) { - local_irq_enable(); /* * Invoking slow path likely have side-effect * of re-populating per CPU c->freelist */ - p[i] = __slab_alloc(s, flags, NUMA_NO_NODE, + p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_, c); - if (unlikely(!p[i])) { - __kmem_cache_free_bulk(s, i, p); - return false; - } - local_irq_disable(); + if (unlikely(!p[i])) + goto error; + c = this_cpu_ptr(s->cpu_slab); continue; /* goto for-loop */ } /* kmem_cache debug support */ s = slab_pre_alloc_hook(s, flags); - if (unlikely(!s)) { - __kmem_cache_free_bulk(s, i, p); - c->tid = next_tid(c->tid); - local_irq_enable(); - return false; - } + if (unlikely(!s)) + goto error; c->freelist = get_freepointer(s, object); p[i] = object; @@ -2864,6 +2857,11 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, } return true; + +error: + __kmem_cache_free_bulk(s, i, p); + local_irq_enable(); + return false; } EXPORT_SYMBOL(kmem_cache_alloc_bulk); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f176.google.com (mail-qk0-f176.google.com [209.85.220.176]) by kanga.kvack.org (Postfix) with ESMTP id 83D306B0259 for ; Mon, 28 Sep 2015 08:26:22 -0400 (EDT) Received: by qkas79 with SMTP id s79so1821150qka.0 for ; Mon, 28 Sep 2015 05:26:22 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id j188si15185212qhc.56.2015.09.28.05.26.21 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 05:26:21 -0700 (PDT) Subject: [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG From: Jesper Dangaard Brouer Date: Mon, 28 Sep 2015 14:26:19 +0200 Message-ID: <20150928122619.15409.68763.stgit@canyon> In-Reply-To: <20150928122444.15409.10498.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim The #ifdef of CONFIG_SLUB_DEBUG is located very far from the associated #else. For readability mark it with a comment. Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 024eed32da2c..1cf98d89546d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1202,7 +1202,7 @@ unsigned long kmem_cache_flags(unsigned long object_size, return flags; } -#else +#else /* !CONFIG_SLUB_DEBUG */ static inline void setup_object_debug(struct kmem_cache *s, struct page *page, void *object) {} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f173.google.com (mail-qk0-f173.google.com [209.85.220.173]) by kanga.kvack.org (Postfix) with ESMTP id A4C7E6B025A for ; Mon, 28 Sep 2015 08:26:27 -0400 (EDT) Received: by qkas79 with SMTP id s79so1822086qka.0 for ; Mon, 28 Sep 2015 05:26:27 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 84si15173699qhx.86.2015.09.28.05.26.26 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 05:26:26 -0700 (PDT) Subject: [PATCH 4/7] slab: implement bulking for SLAB allocator From: Jesper Dangaard Brouer Date: Mon, 28 Sep 2015 14:26:24 +0200 Message-ID: <20150928122624.15409.23038.stgit@canyon> In-Reply-To: <20150928122444.15409.10498.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim Implement a basic approach of bulking in the SLAB allocator. Simply use local_irq_{disable,enable} and call single alloc/free in a loop. This simple implementation approach is surprising fast. Notice the normal SLAB fastpath is: 96 cycles (24.119 ns). Below table show that single object bulking only takes 42 cycles. This can be explained by the bulk APIs requirement to be called from a known interrupt context, that is with interrupts enabled. This allow us to avoid the expensive (37 cycles) local_irq_{save,restore}, and instead use the much faster (7 cycles) local_irq_{disable,restore}. Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz: bulk - Current - simple SLAB bulk implementation 1 - 115 cycles(tsc) 28.812 ns - 42 cycles(tsc) 10.715 ns - improved 63.5% 2 - 103 cycles(tsc) 25.956 ns - 27 cycles(tsc) 6.985 ns - improved 73.8% 3 - 101 cycles(tsc) 25.336 ns - 22 cycles(tsc) 5.733 ns - improved 78.2% 4 - 100 cycles(tsc) 25.147 ns - 21 cycles(tsc) 5.319 ns - improved 79.0% 8 - 98 cycles(tsc) 24.616 ns - 18 cycles(tsc) 4.620 ns - improved 81.6% 16 - 97 cycles(tsc) 24.408 ns - 17 cycles(tsc) 4.344 ns - improved 82.5% 30 - 98 cycles(tsc) 24.641 ns - 16 cycles(tsc) 4.202 ns - improved 83.7% 32 - 98 cycles(tsc) 24.607 ns - 16 cycles(tsc) 4.199 ns - improved 83.7% 34 - 98 cycles(tsc) 24.605 ns - 18 cycles(tsc) 4.579 ns - improved 81.6% 48 - 97 cycles(tsc) 24.463 ns - 17 cycles(tsc) 4.405 ns - improved 82.5% 64 - 97 cycles(tsc) 24.370 ns - 17 cycles(tsc) 4.384 ns - improved 82.5% 128 - 99 cycles(tsc) 24.763 ns - 19 cycles(tsc) 4.755 ns - improved 80.8% 158 - 98 cycles(tsc) 24.708 ns - 18 cycles(tsc) 4.723 ns - improved 81.6% 250 - 101 cycles(tsc) 25.342 ns - 20 cycles(tsc) 5.035 ns - improved 80.2% Also notice how well bulking maintains the performance when the bulk size increases (which is a soar spot for the SLUB allocator). Increasing the bulk size further: 20 cycles(tsc) 5.214 ns (bulk: 512) 30 cycles(tsc) 7.734 ns (bulk: 768) 40 cycles(tsc) 10.244 ns (bulk:1024) 72 cycles(tsc) 18.049 ns (bulk:2048) 90 cycles(tsc) 22.585 ns (bulk:4096) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c Signed-off-by: Jesper Dangaard Brouer --- mm/slab.c | 87 +++++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 62 insertions(+), 25 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index c77ebe6cc87c..21da6b1ccae3 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3234,11 +3234,15 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags) #endif /* CONFIG_NUMA */ static __always_inline void * -slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller) +slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller, + bool irq_off_needed) { unsigned long save_flags; void *objp; + /* Compiler need to remove irq_off_needed branch statements */ + BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed)); + flags &= gfp_allowed_mask; lockdep_trace_alloc(flags); @@ -3249,9 +3253,11 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller) cachep = memcg_kmem_get_cache(cachep, flags); cache_alloc_debugcheck_before(cachep, flags); - local_irq_save(save_flags); + if (irq_off_needed) + local_irq_save(save_flags); objp = __do_cache_alloc(cachep, flags); - local_irq_restore(save_flags); + if (irq_off_needed) + local_irq_restore(save_flags); objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller); kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags, flags); @@ -3407,7 +3413,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp, */ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) { - void *ret = slab_alloc(cachep, flags, _RET_IP_); + void *ret = slab_alloc(cachep, flags, _RET_IP_, true); trace_kmem_cache_alloc(_RET_IP_, ret, cachep->object_size, cachep->size, flags); @@ -3416,16 +3422,23 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) } EXPORT_SYMBOL(kmem_cache_alloc); -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) -{ - __kmem_cache_free_bulk(s, size, p); -} -EXPORT_SYMBOL(kmem_cache_free_bulk); - +/* Note that interrupts must be enabled when calling this function. */ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, - void **p) + void **p) { - return __kmem_cache_alloc_bulk(s, flags, size, p); + size_t i; + + local_irq_disable(); + for (i = 0; i < size; i++) { + void *x = p[i] = slab_alloc(s, flags, _RET_IP_, false); + + if (!x) { + __kmem_cache_free_bulk(s, i, p); + return false; + } + } + local_irq_enable(); + return true; } EXPORT_SYMBOL(kmem_cache_alloc_bulk); @@ -3435,7 +3448,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size) { void *ret; - ret = slab_alloc(cachep, flags, _RET_IP_); + ret = slab_alloc(cachep, flags, _RET_IP_, true); trace_kmalloc(_RET_IP_, ret, size, cachep->size, flags); @@ -3526,7 +3539,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags, cachep = kmalloc_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(cachep))) return cachep; - ret = slab_alloc(cachep, flags, caller); + ret = slab_alloc(cachep, flags, caller, true); trace_kmalloc(caller, ret, size, cachep->size, flags); @@ -3546,32 +3559,56 @@ void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller) } EXPORT_SYMBOL(__kmalloc_track_caller); -/** - * kmem_cache_free - Deallocate an object - * @cachep: The cache the allocation was from. - * @objp: The previously allocated object. - * - * Free an object which was previously allocated from this - * cache. - */ -void kmem_cache_free(struct kmem_cache *cachep, void *objp) +/* Caller is responsible for disabling local IRQs */ +static __always_inline void __kmem_cache_free(struct kmem_cache *cachep, + void *objp, bool irq_off_needed) { unsigned long flags; + + /* Compiler need to remove irq_off_needed branch statements */ + BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed)); + cachep = cache_from_obj(cachep, objp); if (!cachep) return; - local_irq_save(flags); + if (irq_off_needed) + local_irq_save(flags); debug_check_no_locks_freed(objp, cachep->object_size); if (!(cachep->flags & SLAB_DEBUG_OBJECTS)) debug_check_no_obj_freed(objp, cachep->object_size); __cache_free(cachep, objp, _RET_IP_); - local_irq_restore(flags); + if (irq_off_needed) + local_irq_restore(flags); +} +/** + * kmem_cache_free - Deallocate an object + * @cachep: The cache the allocation was from. + * @objp: The previously allocated object. + * + * Free an object which was previously allocated from this + * cache. + */ +void kmem_cache_free(struct kmem_cache *cachep, void *objp) +{ + __kmem_cache_free(cachep, objp, true); trace_kmem_cache_free(_RET_IP_, objp); } EXPORT_SYMBOL(kmem_cache_free); +/* Note that interrupts must be enabled when calling this function. */ +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +{ + size_t i; + + local_irq_disable(); + for (i = 0; i < size; i++) + __kmem_cache_free(s, p[i], false); + local_irq_enable(); +} +EXPORT_SYMBOL(kmem_cache_free_bulk); + /** * kfree - free previously allocated memory * @objp: pointer returned by kmalloc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f176.google.com (mail-qk0-f176.google.com [209.85.220.176]) by kanga.kvack.org (Postfix) with ESMTP id B72596B025B for ; Mon, 28 Sep 2015 08:26:32 -0400 (EDT) Received: by qkcf65 with SMTP id f65so66602797qkc.3 for ; Mon, 28 Sep 2015 05:26:32 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id e11si15235957qgf.0.2015.09.28.05.26.31 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 05:26:32 -0700 (PDT) Subject: [PATCH 5/7] slub: support for bulk free with SLUB freelists From: Jesper Dangaard Brouer Date: Mon, 28 Sep 2015 14:26:29 +0200 Message-ID: <20150928122629.15409.69466.stgit@canyon> In-Reply-To: <20150928122444.15409.10498.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim Make it possible to free a freelist with several objects by extending __slab_free() and slab_free() with two arguments: a freelist_head pointer and objects counter (cnt). If freelist_head pointer is set, then the object is the freelist tail pointer. This allows a freelist with several objects (all within the same slub-page) to be free'ed using a single locked cmpxchg_double in __slab_free() and with an unlocked cmpxchg_double in slab_free(). Object debugging on the free path is also extended to handle these freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if objects don't belong to the same slub-page. These changes are needed for the next patch to bulk free the detached freelists it introduces and constructs. Micro benchmarking showed no performance reduction due to this change, when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexander Duyck --- mm/slub.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 84 insertions(+), 13 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 1cf98d89546d..13b5f53e4840 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) { u8 *p = object; + /* Freepointer not overwritten as SLAB_POISON moved it after object */ if (s->flags & __OBJECT_POISON) { memset(p, POISON_FREE, s->object_size - 1); p[s->object_size - 1] = POISON_END; } + /* + * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then + * freepointer is still safe, as then s->offset equals + * s->inuse and below redzone is after s->object_size and only + * area between s->object_size and s->inuse. + */ if (s->flags & SLAB_RED_ZONE) memset(p + s->object_size, val, s->inuse - s->object_size); } @@ -1063,18 +1070,32 @@ bad: return 0; } +/* Supports checking bulk free of a constructed freelist */ static noinline struct kmem_cache_node *free_debug_processing( - struct kmem_cache *s, struct page *page, void *object, + struct kmem_cache *s, struct page *page, + void *obj_tail, void *freelist_head, int bulk_cnt, unsigned long addr, unsigned long *flags) { struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + void *object; + int cnt = 0; spin_lock_irqsave(&n->list_lock, *flags); slab_lock(page); + /* + * Bulk free of a constructed freelist is indicated by the + * freelist_head pointer being set, else obj_tail is object + * being free'ed + */ + object = freelist_head ? : obj_tail; + if (!check_slab(s, page)) goto fail; +next_object: + cnt++; + if (!check_valid_pointer(s, page, object)) { slab_err(s, page, "Invalid object pointer 0x%p", object); goto fail; @@ -1105,8 +1126,19 @@ static noinline struct kmem_cache_node *free_debug_processing( if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); trace(s, page, object, 0); + /* Freepointer not overwritten by init_object(), SLAB_POISON moved it */ init_object(s, object, SLUB_RED_INACTIVE); + + /* Reached end of constructed freelist yet? */ + if (object != obj_tail) { + object = get_freepointer(s, object); + goto next_object; + } out: + if (cnt != bulk_cnt) + slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n", + bulk_cnt, cnt); + slab_unlock(page); /* * Keep node_lock to preserve integrity @@ -1210,7 +1242,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s, struct page *page, void *object, unsigned long addr) { return 0; } static inline struct kmem_cache_node *free_debug_processing( - struct kmem_cache *s, struct page *page, void *object, + struct kmem_cache *s, struct page *page, + void *obj_tail, void *freelist_head, int bulk_cnt, unsigned long addr, unsigned long *flags) { return NULL; } static inline int slab_pad_check(struct kmem_cache *s, struct page *page) @@ -1306,6 +1339,35 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x) kasan_slab_free(s, x); } +/* Compiler cannot detect that slab_free_freelist_hook() can be + * removed if slab_free_hook() evaluates to nothing. Thus, we need to + * catch all relevant config debug options here. + */ +#if defined(CONFIG_KMEMCHECK) || \ + defined(CONFIG_LOCKDEP) || \ + defined(CONFIG_DEBUG_KMEMLEAK) || \ + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ + defined(CONFIG_KASAN) +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, + void *freelist_head) +{ + /* + * Bulk free of a constructed freelist is indicated by the + * freelist_head pointer being set, else obj_tail is object + * being free'ed + */ + void *object = freelist_head ? : obj_tail; + + do { + slab_free_hook(s, object); + } while ((object != obj_tail) && + (object = get_freepointer(s, object))); +} +#else +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, + void *freelist_head) {} +#endif + static void setup_object(struct kmem_cache *s, struct page *page, void *object) { @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); * So we still attempt to reduce cache line usage. Just take the slab * lock and free the item. If there is no additional partial page * handling required then we can return immediately. + * + * Bulk free of a freelist with several objects (all pointing to the + * same page) possible by specifying freelist_head ptr and object as + * tail ptr, plus objects count (cnt). */ static void __slab_free(struct kmem_cache *s, struct page *page, - void *x, unsigned long addr) + void *x, unsigned long addr, + void *freelist_head, int cnt) { void *prior; void **object = (void *)x; @@ -2595,11 +2662,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page, unsigned long counters; struct kmem_cache_node *n = NULL; unsigned long uninitialized_var(flags); + void *new_freelist = freelist_head ? : x; stat(s, FREE_SLOWPATH); if (kmem_cache_debug(s) && - !(n = free_debug_processing(s, page, x, addr, &flags))) + !(n = free_debug_processing(s, page, x, freelist_head, cnt, + addr, &flags))) return; do { @@ -2612,7 +2681,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, set_freepointer(s, object, prior); new.counters = counters; was_frozen = new.frozen; - new.inuse--; + new.inuse -= cnt; if ((!new.inuse || !prior) && !was_frozen) { if (kmem_cache_has_cpu_partial(s) && !prior) { @@ -2643,7 +2712,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } while (!cmpxchg_double_slab(s, page, prior, counters, - object, new.counters, + new_freelist, new.counters, "__slab_free")); if (likely(!n)) { @@ -2710,13 +2779,15 @@ slab_empty: * with all sorts of special processing. */ static __always_inline void slab_free(struct kmem_cache *s, - struct page *page, void *x, unsigned long addr) + struct page *page, void *x, unsigned long addr, + void *freelist_head, int cnt) { void **object = (void *)x; + void *new_freelist = freelist_head ? : x; struct kmem_cache_cpu *c; unsigned long tid; - slab_free_hook(s, x); + slab_free_freelist_hook(s, x, freelist_head); redo: /* @@ -2740,14 +2811,14 @@ redo: if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, - object, next_tid(tid)))) { + new_freelist, next_tid(tid)))) { note_cmpxchg_failure("slab_free", s, tid); goto redo; } stat(s, FREE_FASTPATH); } else - __slab_free(s, page, x, addr); + __slab_free(s, page, x, addr, freelist_head, cnt); } @@ -2756,7 +2827,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x) s = cache_from_obj(s, x); if (!s) return; - slab_free(s, virt_to_head_page(x), x, _RET_IP_); + slab_free(s, virt_to_head_page(x), x, _RET_IP_, NULL, 1); trace_kmem_cache_free(_RET_IP_, x); } EXPORT_SYMBOL(kmem_cache_free); @@ -2791,7 +2862,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) c->tid = next_tid(c->tid); local_irq_enable(); /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_); + __slab_free(s, page, object, _RET_IP_, NULL, 1); local_irq_disable(); c = this_cpu_ptr(s->cpu_slab); } @@ -3531,7 +3602,7 @@ void kfree(const void *x) __free_kmem_pages(page, compound_order(page)); return; } - slab_free(page->slab_cache, page, object, _RET_IP_); + slab_free(page->slab_cache, page, object, _RET_IP_, NULL, 1); } EXPORT_SYMBOL(kfree); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f177.google.com (mail-qk0-f177.google.com [209.85.220.177]) by kanga.kvack.org (Postfix) with ESMTP id B62E06B025C for ; Mon, 28 Sep 2015 08:26:37 -0400 (EDT) Received: by qkcf65 with SMTP id f65so66603931qkc.3 for ; Mon, 28 Sep 2015 05:26:37 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 7si15203157qgo.95.2015.09.28.05.26.36 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 05:26:37 -0700 (PDT) Subject: [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist From: Jesper Dangaard Brouer Date: Mon, 28 Sep 2015 14:26:34 +0200 Message-ID: <20150928122634.15409.6956.stgit@canyon> In-Reply-To: <20150928122444.15409.10498.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim This change focus on improving the speed of object freeing in the "slowpath" of kmem_cache_free_bulk. The calls slab_free (fastpath) and __slab_free (slowpath) have been extended with support for bulk free, which amortize the overhead of the (locked) cmpxchg_double. To use the new bulking feature, we build what I call a detached freelist. The detached freelist takes advantage of three properties: 1) the free function call owns the object that is about to be freed, thus writing into this memory is synchronization-free. 2) many freelist's can co-exist side-by-side in the same slab-page each with a separate head pointer. 3) it is the visibility of the head pointer that needs synchronization. Given these properties, the brilliant part is that the detached freelist can be constructed without any need for synchronization. The freelist is constructed directly in the page objects, without any synchronization needed. The detached freelist is allocated on the stack of the function call kmem_cache_free_bulk. Thus, the freelist head pointer is not visible to other CPUs. All objects in a SLUB freelist must belong to the same slab-page. Thus, constructing the detached freelist is about matching objects that belong to the same slab-page. The bulk free array is scanned is a progressive manor with a limited look-ahead facility. Kmem debug support is handled in call of slab_free(). Notice kmem_cache_free_bulk no longer need to disable IRQs. This only slowed down single free bulk with approx 3 cycles. Performance data: Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns To get stable and comparable numbers, the kernel have been booted with "slab_merge" (this also improve performance for larger bulk sizes). Performance data, compared against fallback bulking: bulk - fallback bulk - improvement with this patch 1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0% 2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5% 3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6% 4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5% 8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0% 16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3% 30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3% 32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0% 34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0% 48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7% 64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0% 128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0% 158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7% 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4% Performance data, compared current in-kernel bulking: bulk - curr in-kernel - improvement with this patch 1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5% 2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1% 3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5% 4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1% 8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9% 16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6% 30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5% 48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0% 64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2% 128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9% 158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0% 250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0% Performance with normal SLUB merging is significantly slower for larger bulking. This is believed to (primarily) be an effect of not having to share the per-CPU data-structures, as tuning per-CPU size can achieve similar performance. bulk - slab_nomerge - normal SLUB merge 1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0 2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0 3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0 4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0 8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0 16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0 30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5 32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4 34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1 48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1 64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28 128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30 158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29 250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19 Joint work with Alexander Duyck. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexander Duyck --- mm/slub.c | 109 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 79 insertions(+), 30 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 13b5f53e4840..c25717ab3b5a 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2832,44 +2832,93 @@ void kmem_cache_free(struct kmem_cache *s, void *x) } EXPORT_SYMBOL(kmem_cache_free); -/* Note that interrupts must be enabled when calling this function. */ -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) -{ - struct kmem_cache_cpu *c; +struct detached_freelist { struct page *page; - int i; + void *tail_object; + void *freelist; + int cnt; +}; - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); +/* + * This function progressively scans the array with free objects (with + * a limited look ahead) and extract objects belonging to the same + * page. It builds a detached freelist directly within the given + * page/objects. This can happen without any need for + * synchronization, because the objects are owned by running process. + * The freelist is build up as a single linked list in the objects. + * The idea is, that this detached freelist can then be bulk + * transferred to the real freelist(s), but only requiring a single + * synchronization primitive. Look ahead in the array is limited due + * to performance reasons. + */ +static int build_detached_freelist(struct kmem_cache *s, size_t size, + void **p, struct detached_freelist *df) +{ + size_t first_skipped_index = 0; + int lookahead = 3; + void *object; - for (i = 0; i < size; i++) { - void *object = p[i]; + /* Always re-init detached_freelist */ + df->page = NULL; - BUG_ON(!object); - /* kmem cache debug support */ - s = cache_from_obj(s, object); - if (unlikely(!s)) - goto exit; - slab_free_hook(s, object); + do { + object = p[--size]; + } while (!object && size); - page = virt_to_head_page(object); + if (!object) + return 0; - if (c->page == page) { - /* Fastpath: local CPU free */ - set_freepointer(s, object, c->freelist); - c->freelist = object; - } else { - c->tid = next_tid(c->tid); - local_irq_enable(); - /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_, NULL, 1); - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); + /* Start new detached freelist */ + set_freepointer(s, object, NULL); + df->page = virt_to_head_page(object); + df->tail_object = object; + df->freelist = object; + p[size] = NULL; /* mark object processed */ + df->cnt = 1; + + while (size) { + object = p[--size]; + if (!object) + continue; /* Skip processed objects */ + + /* df->page is always set at this point */ + if (df->page == virt_to_head_page(object)) { + /* Opportunity build freelist */ + set_freepointer(s, object, df->freelist); + df->freelist = object; + df->cnt++; + p[size] = NULL; /* mark object processed */ + + continue; } + + /* Limit look ahead search */ + if (!--lookahead) + break; + + if (!first_skipped_index) + first_skipped_index = size + 1; } -exit: - c->tid = next_tid(c->tid); - local_irq_enable(); + + return first_skipped_index; +} + + +/* Note that interrupts must be enabled when calling this function. */ +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +{ + BUG_ON(!size); + + do { + struct detached_freelist df; + + size = build_detached_freelist(s, size, p, &df); + if (unlikely(!df.page)) + continue; + + slab_free(s, df.page, df.tail_object, _RET_IP_, + df.freelist, df.cnt); + } while (likely(size)); } EXPORT_SYMBOL(kmem_cache_free_bulk); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f45.google.com (mail-qg0-f45.google.com [209.85.192.45]) by kanga.kvack.org (Postfix) with ESMTP id 8493F6B025E for ; Mon, 28 Sep 2015 08:26:42 -0400 (EDT) Received: by qgx61 with SMTP id 61so118356461qgx.3 for ; Mon, 28 Sep 2015 05:26:42 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id z35si15224511qgz.33.2015.09.28.05.26.41 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 05:26:42 -0700 (PDT) Subject: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() From: Jesper Dangaard Brouer Date: Mon, 28 Sep 2015 14:26:39 +0200 Message-ID: <20150928122639.15409.21583.stgit@canyon> In-Reply-To: <20150928122444.15409.10498.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim For practical use-cases it is beneficial to prefetch the next freelist object in bulk allocation loop. Micro benchmarking show approx 1 cycle change: bulk - prev-patch - this patch 1 - 49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0 2 - 30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1 3 - 23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2 4 - 20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2 8 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 16 - 17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1 30 - 18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1 32 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 34 - 23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1 48 - 21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1 64 - 20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1 128 - 27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0 158 - 30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0 250 - 37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0 Note, benchmark done with slab_nomerge to keep it stable enough for accurate comparison. Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index c25717ab3b5a..5af75a618b91 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, goto error; c = this_cpu_ptr(s->cpu_slab); + prefetch_freepointer(s, c->freelist); continue; /* goto for-loop */ } @@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, goto error; c->freelist = get_freepointer(s, object); + prefetch_freepointer(s, c->freelist); p[i] = object; /* kmem_cache debug support */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f178.google.com (mail-io0-f178.google.com [209.85.223.178]) by kanga.kvack.org (Postfix) with ESMTP id 141B86B0038 for ; Mon, 28 Sep 2015 09:49:17 -0400 (EDT) Received: by ioii196 with SMTP id i196so175996631ioi.3 for ; Mon, 28 Sep 2015 06:49:16 -0700 (PDT) Received: from resqmta-ch2-04v.sys.comcast.net (resqmta-ch2-04v.sys.comcast.net. [2001:558:fe21:29:69:252:207:36]) by mx.google.com with ESMTPS id p19si11653319igs.100.2015.09.28.06.49.16 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 28 Sep 2015 06:49:16 -0700 (PDT) Date: Mon, 28 Sep 2015 08:49:15 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG In-Reply-To: <20150928122619.15409.68763.stgit@canyon> Message-ID: References: <20150928122444.15409.10498.stgit@canyon> <20150928122619.15409.68763.stgit@canyon> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > The #ifdef of CONFIG_SLUB_DEBUG is located very far from > the associated #else. For readability mark it with a comment. Acked-by: Christoph Lameter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id C71316B0254 for ; Mon, 28 Sep 2015 10:53:18 -0400 (EDT) Received: by pablk4 with SMTP id lk4so79753751pab.3 for ; Mon, 28 Sep 2015 07:53:18 -0700 (PDT) Received: from mail-pa0-x22d.google.com (mail-pa0-x22d.google.com. [2607:f8b0:400e:c03::22d]) by mx.google.com with ESMTPS id ud7si29039781pab.185.2015.09.28.07.53.17 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 07:53:18 -0700 (PDT) Received: by pacfv12 with SMTP id fv12so180426435pac.2 for ; Mon, 28 Sep 2015 07:53:17 -0700 (PDT) Subject: Re: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() References: <20150928122444.15409.10498.stgit@canyon> <20150928122639.15409.21583.stgit@canyon> From: Alexander Duyck Message-ID: <5609545C.4010807@gmail.com> Date: Mon, 28 Sep 2015 07:53:16 -0700 MIME-Version: 1.0 In-Reply-To: <20150928122639.15409.21583.stgit@canyon> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer , linux-mm@kvack.org, Andrew Morton Cc: netdev@vger.kernel.org, Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim On 09/28/2015 05:26 AM, Jesper Dangaard Brouer wrote: > For practical use-cases it is beneficial to prefetch the next freelist > object in bulk allocation loop. > > Micro benchmarking show approx 1 cycle change: > > bulk - prev-patch - this patch > 1 - 49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0 > 2 - 30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1 > 3 - 23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2 > 4 - 20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2 > 8 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 > 16 - 17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1 > 30 - 18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1 > 32 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 > 34 - 23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1 > 48 - 21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1 > 64 - 20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1 > 128 - 27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0 > 158 - 30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0 > 250 - 37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0 > > Note, benchmark done with slab_nomerge to keep it stable enough > for accurate comparison. > > Signed-off-by: Jesper Dangaard Brouer > --- > mm/slub.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/slub.c b/mm/slub.c > index c25717ab3b5a..5af75a618b91 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > goto error; > > c = this_cpu_ptr(s->cpu_slab); > + prefetch_freepointer(s, c->freelist); > continue; /* goto for-loop */ > } > > @@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > goto error; > > c->freelist = get_freepointer(s, object); > + prefetch_freepointer(s, c->freelist); > p[i] = object; > > /* kmem_cache debug support */ > I can see the prefetch in the last item case being possibly useful since you have time between when you call the prefetch and when you are accessing the next object. However, is there any actual benefit to prefetching inside the loop itself? Based on your data above it doesn't seem like that is the case since you are now adding one additional cycle to the allocation and I am not seeing any actual gain reported here. - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f176.google.com (mail-io0-f176.google.com [209.85.223.176]) by kanga.kvack.org (Postfix) with ESMTP id 41F1D6B0254 for ; Mon, 28 Sep 2015 11:11:11 -0400 (EDT) Received: by ioiz6 with SMTP id z6so177822201ioi.2 for ; Mon, 28 Sep 2015 08:11:11 -0700 (PDT) Received: from resqmta-ch2-05v.sys.comcast.net (resqmta-ch2-05v.sys.comcast.net. [2001:558:fe21:29:69:252:207:37]) by mx.google.com with ESMTPS id c5si12074458igm.103.2015.09.28.08.11.10 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 28 Sep 2015 08:11:10 -0700 (PDT) Date: Mon, 28 Sep 2015 10:11:09 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 4/7] slab: implement bulking for SLAB allocator In-Reply-To: <20150928122624.15409.23038.stgit@canyon> Message-ID: References: <20150928122444.15409.10498.stgit@canyon> <20150928122624.15409.23038.stgit@canyon> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > +/* Note that interrupts must be enabled when calling this function. */ > bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > - void **p) > + void **p) > { > - return __kmem_cache_alloc_bulk(s, flags, size, p); > + size_t i; > + > + local_irq_disable(); > + for (i = 0; i < size; i++) { > + void *x = p[i] = slab_alloc(s, flags, _RET_IP_, false); > + > + if (!x) { > + __kmem_cache_free_bulk(s, i, p); > + return false; > + } > + } > + local_irq_enable(); > + return true; > } > EXPORT_SYMBOL(kmem_cache_alloc_bulk); > Ok the above could result in excessive times when the interrupts are kept off. Lets say someone is freeing 1000 objects? > +/* Note that interrupts must be enabled when calling this function. */ > +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > +{ > + size_t i; > + > + local_irq_disable(); > + for (i = 0; i < size; i++) > + __kmem_cache_free(s, p[i], false); > + local_irq_enable(); > +} > +EXPORT_SYMBOL(kmem_cache_free_bulk); Same concern here. We may just have to accept this for now. Acked-by: Christoph Lameter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f173.google.com (mail-ig0-f173.google.com [209.85.213.173]) by kanga.kvack.org (Postfix) with ESMTP id 3ABB16B0258 for ; Mon, 28 Sep 2015 11:16:52 -0400 (EDT) Received: by igbni9 with SMTP id ni9so52307638igb.0 for ; Mon, 28 Sep 2015 08:16:52 -0700 (PDT) Received: from resqmta-po-01v.sys.comcast.net (resqmta-po-01v.sys.comcast.net. [2001:558:fe16:19:96:114:154:160]) by mx.google.com with ESMTPS id 35si12613014ioq.87.2015.09.28.08.16.51 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 28 Sep 2015 08:16:51 -0700 (PDT) Date: Mon, 28 Sep 2015 10:16:49 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists In-Reply-To: <20150928122629.15409.69466.stgit@canyon> Message-ID: References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > diff --git a/mm/slub.c b/mm/slub.c > index 1cf98d89546d..13b5f53e4840 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) > { > u8 *p = object; > > + /* Freepointer not overwritten as SLAB_POISON moved it after object */ > if (s->flags & __OBJECT_POISON) { > memset(p, POISON_FREE, s->object_size - 1); > p[s->object_size - 1] = POISON_END; > } > > + /* > + * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then > + * freepointer is still safe, as then s->offset equals > + * s->inuse and below redzone is after s->object_size and only > + * area between s->object_size and s->inuse. > + */ > if (s->flags & SLAB_RED_ZONE) > memset(p + s->object_size, val, s->inuse - s->object_size); > } Are these comments really adding something? This is basic metadata handling for SLUB that is commented on elsehwere. > @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); > * So we still attempt to reduce cache line usage. Just take the slab > * lock and free the item. If there is no additional partial page > * handling required then we can return immediately. > + * > + * Bulk free of a freelist with several objects (all pointing to the > + * same page) possible by specifying freelist_head ptr and object as > + * tail ptr, plus objects count (cnt). > */ > static void __slab_free(struct kmem_cache *s, struct page *page, > - void *x, unsigned long addr) > + void *x, unsigned long addr, > + void *freelist_head, int cnt) Do you really need separate parameters for freelist_head? If you just want to deal with one object pass it as freelist_head and set cnt = 1? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f174.google.com (mail-io0-f174.google.com [209.85.223.174]) by kanga.kvack.org (Postfix) with ESMTP id 924666B0254 for ; Mon, 28 Sep 2015 11:22:53 -0400 (EDT) Received: by iofb144 with SMTP id b144so179455194iof.1 for ; Mon, 28 Sep 2015 08:22:53 -0700 (PDT) Received: from resqmta-ch2-09v.sys.comcast.net (resqmta-ch2-09v.sys.comcast.net. [2001:558:fe21:29:69:252:207:41]) by mx.google.com with ESMTPS id w7si10902261iod.51.2015.09.28.08.22.51 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 28 Sep 2015 08:22:52 -0700 (PDT) Date: Mon, 28 Sep 2015 10:22:50 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist In-Reply-To: <20150928122634.15409.6956.stgit@canyon> Message-ID: References: <20150928122444.15409.10498.stgit@canyon> <20150928122634.15409.6956.stgit@canyon> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim Acked-by: Christoph Lameter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f46.google.com (mail-qg0-f46.google.com [209.85.192.46]) by kanga.kvack.org (Postfix) with ESMTP id ABCBD6B0256 for ; Mon, 28 Sep 2015 11:51:20 -0400 (EDT) Received: by qgt47 with SMTP id 47so124185594qgt.2 for ; Mon, 28 Sep 2015 08:51:20 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q191si16034146qha.128.2015.09.28.08.51.19 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 08:51:20 -0700 (PDT) Date: Mon, 28 Sep 2015 17:51:14 +0200 From: Jesper Dangaard Brouer Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists Message-ID: <20150928175114.07e85114@redhat.com> In-Reply-To: References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim , brouer@redhat.com On Mon, 28 Sep 2015 10:16:49 -0500 (CDT) Christoph Lameter wrote: > On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > > > diff --git a/mm/slub.c b/mm/slub.c > > index 1cf98d89546d..13b5f53e4840 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) > > { > > u8 *p = object; > > > > + /* Freepointer not overwritten as SLAB_POISON moved it after object */ > > if (s->flags & __OBJECT_POISON) { > > memset(p, POISON_FREE, s->object_size - 1); > > p[s->object_size - 1] = POISON_END; > > } > > > > + /* > > + * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then > > + * freepointer is still safe, as then s->offset equals > > + * s->inuse and below redzone is after s->object_size and only > > + * area between s->object_size and s->inuse. > > + */ > > if (s->flags & SLAB_RED_ZONE) > > memset(p + s->object_size, val, s->inuse - s->object_size); > > } > > Are these comments really adding something? This is basic metadata > handling for SLUB that is commented on elsehwere. Not knowing SLUB as well as you, it took me several hours to realize init_object() didn't overwrite the freepointer in the object. Thus, I think these comments make the reader aware of not-so-obvious side-effects of SLAB_POISON and SLAB_RED_ZONE. > > @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); > > * So we still attempt to reduce cache line usage. Just take the slab > > * lock and free the item. If there is no additional partial page > > * handling required then we can return immediately. > > + * > > + * Bulk free of a freelist with several objects (all pointing to the > > + * same page) possible by specifying freelist_head ptr and object as > > + * tail ptr, plus objects count (cnt). > > */ > > static void __slab_free(struct kmem_cache *s, struct page *page, > > - void *x, unsigned long addr) > > + void *x, unsigned long addr, > > + void *freelist_head, int cnt) > > Do you really need separate parameters for freelist_head? If you just want > to deal with one object pass it as freelist_head and set cnt = 1? Yes, I need it. We need to know both the head and tail of the list to splice it. See: > @@ -2612,7 +2681,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, prior = page->freelist; counters = page->counters; > set_freepointer(s, object, prior); ^^^^^^ Here we update the tail ptr (object) to point to "prior" (page->freelist). > new.counters = counters; > was_frozen = new.frozen; > - new.inuse--; > + new.inuse -= cnt; > if ((!new.inuse || !prior) && !was_frozen) { > > if (kmem_cache_has_cpu_partial(s) && !prior) { > @@ -2643,7 +2712,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > > } while (!cmpxchg_double_slab(s, page, > prior, counters, > - object, new.counters, > + new_freelist, new.counters, > "__slab_free")); Here we update page->freelist ("prior") to point to the head. Thus, splicing the list. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f49.google.com (mail-qg0-f49.google.com [209.85.192.49]) by kanga.kvack.org (Postfix) with ESMTP id 41B2D6B0038 for ; Mon, 28 Sep 2015 11:59:08 -0400 (EDT) Received: by qgx61 with SMTP id 61so124523564qgx.3 for ; Mon, 28 Sep 2015 08:59:08 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id m9si16110381qkl.95.2015.09.28.08.59.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 08:59:07 -0700 (PDT) Date: Mon, 28 Sep 2015 17:59:01 +0200 From: Jesper Dangaard Brouer Subject: Re: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() Message-ID: <20150928175901.39976cdb@redhat.com> In-Reply-To: <5609545C.4010807@gmail.com> References: <20150928122444.15409.10498.stgit@canyon> <20150928122639.15409.21583.stgit@canyon> <5609545C.4010807@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Alexander Duyck Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim , brouer@redhat.com On Mon, 28 Sep 2015 07:53:16 -0700 Alexander Duyck wrote: > On 09/28/2015 05:26 AM, Jesper Dangaard Brouer wrote: > > For practical use-cases it is beneficial to prefetch the next freelist > > object in bulk allocation loop. > > > > Micro benchmarking show approx 1 cycle change: > > > > bulk - prev-patch - this patch > > 1 - 49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0 > > 2 - 30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1 > > 3 - 23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2 > > 4 - 20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2 > > 8 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 > > 16 - 17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1 > > 30 - 18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1 > > 32 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 > > 34 - 23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1 > > 48 - 21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1 > > 64 - 20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1 > > 128 - 27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0 > > 158 - 30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0 > > 250 - 37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0 > > > > Note, benchmark done with slab_nomerge to keep it stable enough > > for accurate comparison. > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > mm/slub.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/mm/slub.c b/mm/slub.c > > index c25717ab3b5a..5af75a618b91 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > > goto error; > > > > c = this_cpu_ptr(s->cpu_slab); > > + prefetch_freepointer(s, c->freelist); > > continue; /* goto for-loop */ > > } > > > > @@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > > goto error; > > > > c->freelist = get_freepointer(s, object); > > + prefetch_freepointer(s, c->freelist); > > p[i] = object; > > > > /* kmem_cache debug support */ > > > > I can see the prefetch in the last item case being possibly useful since > you have time between when you call the prefetch and when you are > accessing the next object. However, is there any actual benefit to > prefetching inside the loop itself? Based on your data above it doesn't > seem like that is the case since you are now adding one additional cycle > to the allocation and I am not seeing any actual gain reported here. The gain will first show up, when using bulk alloc in real use-cases. As you know, bulk alloc on RX path don't show any improvement. And I measured (with perf-mem-record) L1 miss'es here. I could reduce the L1 misses here by adding prefetch. But I cannot remember if I measured any PPS improvement with this. As you hint, the time I have between my prefetch and use is very small, thus the question is if this will show any benefit for real use-cases. We can drop this patch, and then I'll include it in my network use-case, and measure the effect? (Although I'll likely be wasting my time, as we should likely redesign the alloc API instead). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f181.google.com (mail-io0-f181.google.com [209.85.223.181]) by kanga.kvack.org (Postfix) with ESMTP id F302C6B0259 for ; Mon, 28 Sep 2015 12:28:16 -0400 (EDT) Received: by iofh134 with SMTP id h134so180984857iof.0 for ; Mon, 28 Sep 2015 09:28:16 -0700 (PDT) Received: from resqmta-ch2-05v.sys.comcast.net (resqmta-ch2-05v.sys.comcast.net. [2001:558:fe21:29:69:252:207:37]) by mx.google.com with ESMTPS id c32si12933937iod.141.2015.09.28.09.28.16 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 28 Sep 2015 09:28:16 -0700 (PDT) Date: Mon, 28 Sep 2015 11:28:15 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists In-Reply-To: <20150928175114.07e85114@redhat.com> Message-ID: References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> <20150928175114.07e85114@redhat.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > > Do you really need separate parameters for freelist_head? If you just want > > to deal with one object pass it as freelist_head and set cnt = 1? > > Yes, I need it. We need to know both the head and tail of the list to > splice it. Ok so this is to avoid having to scan the list to its end? x is the end of the list and freelist_head the beginning. That is weird. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by kanga.kvack.org (Postfix) with ESMTP id 14D096B025B for ; Mon, 28 Sep 2015 12:30:02 -0400 (EDT) Received: by igbkq10 with SMTP id kq10so59737922igb.0 for ; Mon, 28 Sep 2015 09:30:01 -0700 (PDT) Received: from resqmta-ch2-11v.sys.comcast.net (resqmta-ch2-11v.sys.comcast.net. [2001:558:fe21:29:69:252:207:43]) by mx.google.com with ESMTPS id gb2si12411492igd.93.2015.09.28.09.30.01 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 28 Sep 2015 09:30:01 -0700 (PDT) Date: Mon, 28 Sep 2015 11:30:00 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists In-Reply-To: <20150928175114.07e85114@redhat.com> Message-ID: References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> <20150928175114.07e85114@redhat.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > Not knowing SLUB as well as you, it took me several hours to realize > init_object() didn't overwrite the freepointer in the object. Thus, I > think these comments make the reader aware of not-so-obvious > side-effects of SLAB_POISON and SLAB_RED_ZONE. >>From the source: /* * Object layout: * * object address * Bytes of the object to be managed. * If the freepointer may overlay the object then the free * pointer is the first word of the object. * * Poisoning uses 0x6b (POISON_FREE) and the last byte is * 0xa5 (POISON_END) * * object + s->object_size * Padding to reach word boundary. This is also used for Redzoning. * Padding is extended by another word if Redzoning is enabled and * object_size == inuse. * * We fill with 0xbb (RED_INACTIVE) for inactive objects and with * 0xcc (RED_ACTIVE) for objects in use. * * object + s->inuse * Meta data starts here. * * A. Free pointer (if we cannot overwrite object on free) * B. Tracking data for SLAB_STORE_USER * C. Padding to reach required alignment boundary or at mininum * one word if debugging is on to be able to detect writes * before the word boundary. * * Padding is done using 0x5a (POISON_INUSE) * * object + s->size * Nothing is used beyond s->size. * * If slabcaches are merged then the object_size and inuse boundaries are mostly * ignored. And therefore no slab options that rely on these boundaries * may be used with merged slabcaches. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f41.google.com (mail-qg0-f41.google.com [209.85.192.41]) by kanga.kvack.org (Postfix) with ESMTP id 8F01A6B0038 for ; Tue, 29 Sep 2015 03:12:26 -0400 (EDT) Received: by qgev79 with SMTP id v79so140974838qge.0 for ; Tue, 29 Sep 2015 00:12:26 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 9si19740840qhx.90.2015.09.29.00.12.25 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 00:12:25 -0700 (PDT) Date: Tue, 29 Sep 2015 09:12:19 +0200 From: Jesper Dangaard Brouer Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists Message-ID: <20150929091219.40d1e217@redhat.com> In-Reply-To: References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> <20150928175114.07e85114@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim , brouer@redhat.com On Mon, 28 Sep 2015 11:30:00 -0500 (CDT) Christoph Lameter wrote: > On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > > > Not knowing SLUB as well as you, it took me several hours to realize > > init_object() didn't overwrite the freepointer in the object. Thus, I > > think these comments make the reader aware of not-so-obvious > > side-effects of SLAB_POISON and SLAB_RED_ZONE. > > From the source: > > /* > * Object layout: > * > * object address > * Bytes of the object to be managed. > * If the freepointer may overlay the object then the free > * pointer is the first word of the object. > * > * Poisoning uses 0x6b (POISON_FREE) and the last byte is > * 0xa5 (POISON_END) > * > * object + s->object_size > * Padding to reach word boundary. This is also used for Redzoning. > * Padding is extended by another word if Redzoning is enabled and > * object_size == inuse. > * > * We fill with 0xbb (RED_INACTIVE) for inactive objects and with > * 0xcc (RED_ACTIVE) for objects in use. > * > * object + s->inuse > * Meta data starts here. > * > * A. Free pointer (if we cannot overwrite object on free) > * B. Tracking data for SLAB_STORE_USER > * C. Padding to reach required alignment boundary or at mininum > * one word if debugging is on to be able to detect writes > * before the word boundary. Okay, I will remove the comment. The best doc on SLUB and SLAB layout comes from your slides titled: "Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB" Lets gracefully add a link to the slides here: http://events.linuxfoundation.org/sites/events/files/slides/slaballocators.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f181.google.com (mail-qk0-f181.google.com [209.85.220.181]) by kanga.kvack.org (Postfix) with ESMTP id 45FF06B0038 for ; Tue, 29 Sep 2015 03:32:31 -0400 (EDT) Received: by qkap81 with SMTP id p81so78380034qka.2 for ; Tue, 29 Sep 2015 00:32:31 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id f74si17363308qkf.43.2015.09.29.00.32.29 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 00:32:30 -0700 (PDT) Date: Tue, 29 Sep 2015 09:32:24 +0200 From: Jesper Dangaard Brouer Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists Message-ID: <20150929093224.65bed249@redhat.com> In-Reply-To: References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> <20150928175114.07e85114@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim , brouer@redhat.com On Mon, 28 Sep 2015 11:28:15 -0500 (CDT) Christoph Lameter wrote: > On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > > > > Do you really need separate parameters for freelist_head? If you just want > > > to deal with one object pass it as freelist_head and set cnt = 1? > > > > Yes, I need it. We need to know both the head and tail of the list to > > splice it. > > Ok so this is to avoid having to scan the list to its end? True. > x is the end > of the list and freelist_head the beginning. That is weird. Yes, it is a bit weird... the bulk free of freelists comes out as a second-class citizen. Okay, I'll try to change the slab_free() and __slab_free() calls to have a "head" and "tail". And let tail be NULL on single object free, to allow compiler to do constant propagation (thus keeping existing fastpath unaffected). (The same code should be generated, but we will have a more intuitive API). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f169.google.com (mail-qk0-f169.google.com [209.85.220.169]) by kanga.kvack.org (Postfix) with ESMTP id 7E8386B0257 for ; Tue, 29 Sep 2015 11:46:40 -0400 (EDT) Received: by qkcf65 with SMTP id f65so4858658qkc.3 for ; Tue, 29 Sep 2015 08:46:40 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id c2si15405631qkh.34.2015.09.29.08.46.39 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 08:46:39 -0700 (PDT) Subject: [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking From: Jesper Dangaard Brouer Date: Tue, 29 Sep 2015 17:46:57 +0200 Message-ID: <20150929154605.14465.98995.stgit@canyon> In-Reply-To: <20150928122444.15409.10498.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton , Christoph Lameter Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim Most important part of this patchset is the introducing of what I call detached freelist, for improving SLUB performance of object freeing in the "slowpath" of kmem_cache_free_bulk. Tagging patchset with "V4" to avoid confusion with "V2": (V2) http://thread.gmane.org/gmane.linux.kernel.mm/137469 Addressing comments from: ("V3") http://thread.gmane.org/gmane.linux.kernel.mm/139268 I've added Christoph Lameter's ACKs from prev review. * Only patch 5 is changed significantly and needs review. * Benchmarked, performance is the same Notes for patches: * First two patches (from Christoph) are already in AKPM MMOTS. * Patch 3 is trivial * Patch 4 is a repost, implements bulking for SLAB. - http://thread.gmane.org/gmane.linux.kernel.mm/138220 * Patch 5 and 6 are the important patches - Patch 5 handle "freelists" in slab_free() and __slab_free(). - Patch 6 intro detached freelists, and significant performance improvement Patches should be ready for the MM-tree, as I'm now handling kmem debug support. Based on top of commit 519f526d39 in net-next, but I've tested it applies on top of mmotm-2015-09-18-16-08. The benchmarking tools are avail here: https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm See: slab_bulk_test0{1,2,3}.c This was joint work with Alexander Duyck while still at Red Hat. This patchset is part of my network stack use-case. I'll post the network side of the patchset as soon as I've cleaned it up, rebased it on net-next and re-run all the benchmarks. --- Christoph Lameter (2): slub: create new ___slab_alloc function that can be called with irqs disabled slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer (4): slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG slab: implement bulking for SLAB allocator slub: support for bulk free with SLUB freelists slub: optimize bulk slowpath free by detached freelist mm/slab.c | 87 ++++++++++++++------ mm/slub.c | 263 +++++++++++++++++++++++++++++++++++++++++++------------------ 2 files changed, 249 insertions(+), 101 deletions(-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f47.google.com (mail-qg0-f47.google.com [209.85.192.47]) by kanga.kvack.org (Postfix) with ESMTP id EFA266B0258 for ; Tue, 29 Sep 2015 11:46:53 -0400 (EDT) Received: by qgx61 with SMTP id 61so9653757qgx.3 for ; Tue, 29 Sep 2015 08:46:53 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 71si21670313qhx.74.2015.09.29.08.46.52 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 08:46:53 -0700 (PDT) Subject: [MM PATCH V4 1/6] slub: create new ___slab_alloc function that can be called with irqs disabled From: Jesper Dangaard Brouer Date: Tue, 29 Sep 2015 17:47:12 +0200 Message-ID: <20150929154702.14465.68483.stgit@canyon> In-Reply-To: <20150929154605.14465.98995.stgit@canyon> References: <20150929154605.14465.98995.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton , Christoph Lameter Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim From: Christoph Lameter NOTICE: Accepted by AKPM http://ozlabs.org/~akpm/mmots/broken-out/slub-create-new-___slab_alloc-function-that-can-be-called-with-irqs-disabled.patch Bulk alloc needs a function like that because it enables interrupts before calling __slab_alloc which promptly disables them again using the expensive local_irq_save(). Signed-off-by: Christoph Lameter Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 44 +++++++++++++++++++++++++++++--------------- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index f614b5dc396b..02cfb3a5983e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2298,23 +2298,15 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page) * And if we were unable to get a new slab from the partial slab lists then * we need to allocate a new slab. This is the slowest path since it involves * a call to the page allocator and the setup of a new slab. + * + * Version of __slab_alloc to use when we know that interrupts are + * already disabled (which is the case for bulk allocation). */ -static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, +static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr, struct kmem_cache_cpu *c) { void *freelist; struct page *page; - unsigned long flags; - - local_irq_save(flags); -#ifdef CONFIG_PREEMPT - /* - * We may have been preempted and rescheduled on a different - * cpu before disabling interrupts. Need to reload cpu area - * pointer. - */ - c = this_cpu_ptr(s->cpu_slab); -#endif page = c->page; if (!page) @@ -2372,7 +2364,6 @@ load_freelist: VM_BUG_ON(!c->page->frozen); c->freelist = get_freepointer(s, freelist); c->tid = next_tid(c->tid); - local_irq_restore(flags); return freelist; new_slab: @@ -2389,7 +2380,6 @@ new_slab: if (unlikely(!freelist)) { slab_out_of_memory(s, gfpflags, node); - local_irq_restore(flags); return NULL; } @@ -2405,11 +2395,35 @@ new_slab: deactivate_slab(s, page, get_freepointer(s, freelist)); c->page = NULL; c->freelist = NULL; - local_irq_restore(flags); return freelist; } /* + * Another one that disabled interrupt and compensates for possible + * cpu changes by refetching the per cpu area pointer. + */ +static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, + unsigned long addr, struct kmem_cache_cpu *c) +{ + void *p; + unsigned long flags; + + local_irq_save(flags); +#ifdef CONFIG_PREEMPT + /* + * We may have been preempted and rescheduled on a different + * cpu before disabling interrupts. Need to reload cpu area + * pointer. + */ + c = this_cpu_ptr(s->cpu_slab); +#endif + + p = ___slab_alloc(s, gfpflags, node, addr, c); + local_irq_restore(flags); + return p; +} + +/* * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc) * have the fastpath folded into their functions. So no function call * overhead for requests that can be satisfied on the fastpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f169.google.com (mail-qk0-f169.google.com [209.85.220.169]) by kanga.kvack.org (Postfix) with ESMTP id 6BB356B0259 for ; Tue, 29 Sep 2015 11:47:01 -0400 (EDT) Received: by qkcf65 with SMTP id f65so4863824qkc.3 for ; Tue, 29 Sep 2015 08:47:01 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id o20si5223890qki.35.2015.09.29.08.46.59 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 08:47:00 -0700 (PDT) Subject: [MM PATCH V4 2/6] slub: Avoid irqoff/on in bulk allocation From: Jesper Dangaard Brouer Date: Tue, 29 Sep 2015 17:47:18 +0200 Message-ID: <20150929154717.14465.67508.stgit@canyon> In-Reply-To: <20150929154605.14465.98995.stgit@canyon> References: <20150929154605.14465.98995.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton , Christoph Lameter Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim From: Christoph Lameter NOTICE: Accepted by AKPM http://ozlabs.org/~akpm/mmots/broken-out/slub-avoid-irqoff-on-in-bulk-allocation.patch Use the new function that can do allocation while interrupts are disabled. Avoids irq on/off sequences. Signed-off-by: Christoph Lameter Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 02cfb3a5983e..024eed32da2c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2821,30 +2821,23 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void *object = c->freelist; if (unlikely(!object)) { - local_irq_enable(); /* * Invoking slow path likely have side-effect * of re-populating per CPU c->freelist */ - p[i] = __slab_alloc(s, flags, NUMA_NO_NODE, + p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_, c); - if (unlikely(!p[i])) { - __kmem_cache_free_bulk(s, i, p); - return false; - } - local_irq_disable(); + if (unlikely(!p[i])) + goto error; + c = this_cpu_ptr(s->cpu_slab); continue; /* goto for-loop */ } /* kmem_cache debug support */ s = slab_pre_alloc_hook(s, flags); - if (unlikely(!s)) { - __kmem_cache_free_bulk(s, i, p); - c->tid = next_tid(c->tid); - local_irq_enable(); - return false; - } + if (unlikely(!s)) + goto error; c->freelist = get_freepointer(s, object); p[i] = object; @@ -2864,6 +2857,11 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, } return true; + +error: + __kmem_cache_free_bulk(s, i, p); + local_irq_enable(); + return false; } EXPORT_SYMBOL(kmem_cache_alloc_bulk); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f42.google.com (mail-qg0-f42.google.com [209.85.192.42]) by kanga.kvack.org (Postfix) with ESMTP id AED236B025A for ; Tue, 29 Sep 2015 11:47:14 -0400 (EDT) Received: by qgez77 with SMTP id z77so9734272qge.1 for ; Tue, 29 Sep 2015 08:47:14 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 143si21692896qhy.11.2015.09.29.08.47.13 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 08:47:13 -0700 (PDT) Subject: [MM PATCH V4 3/6] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG From: Jesper Dangaard Brouer Date: Tue, 29 Sep 2015 17:47:33 +0200 Message-ID: <20150929154723.14465.6260.stgit@canyon> In-Reply-To: <20150929154605.14465.98995.stgit@canyon> References: <20150929154605.14465.98995.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton , Christoph Lameter Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim The #ifdef of CONFIG_SLUB_DEBUG is located very far from the associated #else. For readability mark it with a comment. Signed-off-by: Jesper Dangaard Brouer Acked-by: Christoph Lameter --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 024eed32da2c..1cf98d89546d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1202,7 +1202,7 @@ unsigned long kmem_cache_flags(unsigned long object_size, return flags; } -#else +#else /* !CONFIG_SLUB_DEBUG */ static inline void setup_object_debug(struct kmem_cache *s, struct page *page, void *object) {} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f45.google.com (mail-qg0-f45.google.com [209.85.192.45]) by kanga.kvack.org (Postfix) with ESMTP id 45EFF6B025E for ; Tue, 29 Sep 2015 11:47:44 -0400 (EDT) Received: by qgt47 with SMTP id 47so9692537qgt.2 for ; Tue, 29 Sep 2015 08:47:44 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id h60si21748209qgh.8.2015.09.29.08.47.43 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 08:47:43 -0700 (PDT) Subject: [MM PATCH V4 4/6] slab: implement bulking for SLAB allocator From: Jesper Dangaard Brouer Date: Tue, 29 Sep 2015 17:48:02 +0200 Message-ID: <20150929154738.14465.13241.stgit@canyon> In-Reply-To: <20150929154605.14465.98995.stgit@canyon> References: <20150929154605.14465.98995.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton , Christoph Lameter Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim Implement a basic approach of bulking in the SLAB allocator. Simply use local_irq_{disable,enable} and call single alloc/free in a loop. This simple implementation approach is surprising fast. Notice the normal SLAB fastpath is: 96 cycles (24.119 ns). Below table show that single object bulking only takes 42 cycles. This can be explained by the bulk APIs requirement to be called from a known interrupt context, that is with interrupts enabled. This allow us to avoid the expensive (37 cycles) local_irq_{save,restore}, and instead use the much faster (7 cycles) local_irq_{disable,restore}. Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz: bulk - Current - simple SLAB bulk implementation 1 - 115 cycles(tsc) 28.812 ns - 42 cycles(tsc) 10.715 ns - improved 63.5% 2 - 103 cycles(tsc) 25.956 ns - 27 cycles(tsc) 6.985 ns - improved 73.8% 3 - 101 cycles(tsc) 25.336 ns - 22 cycles(tsc) 5.733 ns - improved 78.2% 4 - 100 cycles(tsc) 25.147 ns - 21 cycles(tsc) 5.319 ns - improved 79.0% 8 - 98 cycles(tsc) 24.616 ns - 18 cycles(tsc) 4.620 ns - improved 81.6% 16 - 97 cycles(tsc) 24.408 ns - 17 cycles(tsc) 4.344 ns - improved 82.5% 30 - 98 cycles(tsc) 24.641 ns - 16 cycles(tsc) 4.202 ns - improved 83.7% 32 - 98 cycles(tsc) 24.607 ns - 16 cycles(tsc) 4.199 ns - improved 83.7% 34 - 98 cycles(tsc) 24.605 ns - 18 cycles(tsc) 4.579 ns - improved 81.6% 48 - 97 cycles(tsc) 24.463 ns - 17 cycles(tsc) 4.405 ns - improved 82.5% 64 - 97 cycles(tsc) 24.370 ns - 17 cycles(tsc) 4.384 ns - improved 82.5% 128 - 99 cycles(tsc) 24.763 ns - 19 cycles(tsc) 4.755 ns - improved 80.8% 158 - 98 cycles(tsc) 24.708 ns - 18 cycles(tsc) 4.723 ns - improved 81.6% 250 - 101 cycles(tsc) 25.342 ns - 20 cycles(tsc) 5.035 ns - improved 80.2% Also notice how well bulking maintains the performance when the bulk size increases (which is a soar spot for the SLUB allocator). Increasing the bulk size further: 20 cycles(tsc) 5.214 ns (bulk: 512) 30 cycles(tsc) 7.734 ns (bulk: 768) 40 cycles(tsc) 10.244 ns (bulk:1024) 72 cycles(tsc) 18.049 ns (bulk:2048) 90 cycles(tsc) 22.585 ns (bulk:4096) It is not recommended to perform large bulking with SLAB, as local interrupts are disabled for the entire period. If these kind of use-cases evolve, this interface should be adjusted to mitigate/reduce the interrupts off period. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c Signed-off-by: Jesper Dangaard Brouer Acked-by: Christoph Lameter --- mm/slab.c | 87 +++++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 62 insertions(+), 25 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index c77ebe6cc87c..21da6b1ccae3 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3234,11 +3234,15 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags) #endif /* CONFIG_NUMA */ static __always_inline void * -slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller) +slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller, + bool irq_off_needed) { unsigned long save_flags; void *objp; + /* Compiler need to remove irq_off_needed branch statements */ + BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed)); + flags &= gfp_allowed_mask; lockdep_trace_alloc(flags); @@ -3249,9 +3253,11 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller) cachep = memcg_kmem_get_cache(cachep, flags); cache_alloc_debugcheck_before(cachep, flags); - local_irq_save(save_flags); + if (irq_off_needed) + local_irq_save(save_flags); objp = __do_cache_alloc(cachep, flags); - local_irq_restore(save_flags); + if (irq_off_needed) + local_irq_restore(save_flags); objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller); kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags, flags); @@ -3407,7 +3413,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp, */ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) { - void *ret = slab_alloc(cachep, flags, _RET_IP_); + void *ret = slab_alloc(cachep, flags, _RET_IP_, true); trace_kmem_cache_alloc(_RET_IP_, ret, cachep->object_size, cachep->size, flags); @@ -3416,16 +3422,23 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) } EXPORT_SYMBOL(kmem_cache_alloc); -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) -{ - __kmem_cache_free_bulk(s, size, p); -} -EXPORT_SYMBOL(kmem_cache_free_bulk); - +/* Note that interrupts must be enabled when calling this function. */ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, - void **p) + void **p) { - return __kmem_cache_alloc_bulk(s, flags, size, p); + size_t i; + + local_irq_disable(); + for (i = 0; i < size; i++) { + void *x = p[i] = slab_alloc(s, flags, _RET_IP_, false); + + if (!x) { + __kmem_cache_free_bulk(s, i, p); + return false; + } + } + local_irq_enable(); + return true; } EXPORT_SYMBOL(kmem_cache_alloc_bulk); @@ -3435,7 +3448,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size) { void *ret; - ret = slab_alloc(cachep, flags, _RET_IP_); + ret = slab_alloc(cachep, flags, _RET_IP_, true); trace_kmalloc(_RET_IP_, ret, size, cachep->size, flags); @@ -3526,7 +3539,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags, cachep = kmalloc_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(cachep))) return cachep; - ret = slab_alloc(cachep, flags, caller); + ret = slab_alloc(cachep, flags, caller, true); trace_kmalloc(caller, ret, size, cachep->size, flags); @@ -3546,32 +3559,56 @@ void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller) } EXPORT_SYMBOL(__kmalloc_track_caller); -/** - * kmem_cache_free - Deallocate an object - * @cachep: The cache the allocation was from. - * @objp: The previously allocated object. - * - * Free an object which was previously allocated from this - * cache. - */ -void kmem_cache_free(struct kmem_cache *cachep, void *objp) +/* Caller is responsible for disabling local IRQs */ +static __always_inline void __kmem_cache_free(struct kmem_cache *cachep, + void *objp, bool irq_off_needed) { unsigned long flags; + + /* Compiler need to remove irq_off_needed branch statements */ + BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed)); + cachep = cache_from_obj(cachep, objp); if (!cachep) return; - local_irq_save(flags); + if (irq_off_needed) + local_irq_save(flags); debug_check_no_locks_freed(objp, cachep->object_size); if (!(cachep->flags & SLAB_DEBUG_OBJECTS)) debug_check_no_obj_freed(objp, cachep->object_size); __cache_free(cachep, objp, _RET_IP_); - local_irq_restore(flags); + if (irq_off_needed) + local_irq_restore(flags); +} +/** + * kmem_cache_free - Deallocate an object + * @cachep: The cache the allocation was from. + * @objp: The previously allocated object. + * + * Free an object which was previously allocated from this + * cache. + */ +void kmem_cache_free(struct kmem_cache *cachep, void *objp) +{ + __kmem_cache_free(cachep, objp, true); trace_kmem_cache_free(_RET_IP_, objp); } EXPORT_SYMBOL(kmem_cache_free); +/* Note that interrupts must be enabled when calling this function. */ +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +{ + size_t i; + + local_irq_disable(); + for (i = 0; i < size; i++) + __kmem_cache_free(s, p[i], false); + local_irq_enable(); +} +EXPORT_SYMBOL(kmem_cache_free_bulk); + /** * kfree - free previously allocated memory * @objp: pointer returned by kmalloc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f54.google.com (mail-qg0-f54.google.com [209.85.192.54]) by kanga.kvack.org (Postfix) with ESMTP id D1E8E6B0260 for ; Tue, 29 Sep 2015 11:47:58 -0400 (EDT) Received: by qgez77 with SMTP id z77so9754307qge.1 for ; Tue, 29 Sep 2015 08:47:58 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id k32si21737441qge.29.2015.09.29.08.47.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 08:47:58 -0700 (PDT) Subject: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists From: Jesper Dangaard Brouer Date: Tue, 29 Sep 2015 17:48:17 +0200 Message-ID: <20150929154807.14465.76422.stgit@canyon> In-Reply-To: <20150929154605.14465.98995.stgit@canyon> References: <20150929154605.14465.98995.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton , Christoph Lameter Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim Make it possible to free a freelist with several objects by adjusting API of slab_free() and __slab_free() to have head, tail and an objects counter (cnt). Tail being NULL indicate single object free of head object. This allow compiler inline constant propagation in slab_free() and slab_free_freelist_hook() to avoid adding any overhead in case of single object free. This allows a freelist with several objects (all within the same slab-page) to be free'ed using a single locked cmpxchg_double in __slab_free() and with an unlocked cmpxchg_double in slab_free(). Object debugging on the free path is also extended to handle these freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if objects don't belong to the same slab-page. These changes are needed for the next patch to bulk free the detached freelists it introduces and constructs. Micro benchmarking showed no performance reduction due to this change, when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexander Duyck --- V4: - Change API per req of Christoph Lameter - Remove comments in init_object. mm/slub.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 69 insertions(+), 18 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 1cf98d89546d..7c2abc33fd4e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1063,11 +1063,15 @@ bad: return 0; } +/* Supports checking bulk free of a constructed freelist */ static noinline struct kmem_cache_node *free_debug_processing( - struct kmem_cache *s, struct page *page, void *object, + struct kmem_cache *s, struct page *page, + void *head, void *tail, int bulk_cnt, unsigned long addr, unsigned long *flags) { struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + void *object = head; + int cnt = 0; spin_lock_irqsave(&n->list_lock, *flags); slab_lock(page); @@ -1075,6 +1079,9 @@ static noinline struct kmem_cache_node *free_debug_processing( if (!check_slab(s, page)) goto fail; +next_object: + cnt++; + if (!check_valid_pointer(s, page, object)) { slab_err(s, page, "Invalid object pointer 0x%p", object); goto fail; @@ -1105,8 +1112,19 @@ static noinline struct kmem_cache_node *free_debug_processing( if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); trace(s, page, object, 0); + /* Freepointer not overwritten by init_object(), SLAB_POISON moved it */ init_object(s, object, SLUB_RED_INACTIVE); + + /* Reached end of constructed freelist yet? */ + if (object != tail) { + object = get_freepointer(s, object); + goto next_object; + } out: + if (cnt != bulk_cnt) + slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n", + bulk_cnt, cnt); + slab_unlock(page); /* * Keep node_lock to preserve integrity @@ -1210,7 +1228,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s, struct page *page, void *object, unsigned long addr) { return 0; } static inline struct kmem_cache_node *free_debug_processing( - struct kmem_cache *s, struct page *page, void *object, + struct kmem_cache *s, struct page *page, + void *head, void *tail, int bulk_cnt, unsigned long addr, unsigned long *flags) { return NULL; } static inline int slab_pad_check(struct kmem_cache *s, struct page *page) @@ -1306,6 +1325,31 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x) kasan_slab_free(s, x); } +/* Compiler cannot detect that slab_free_freelist_hook() can be + * removed if slab_free_hook() evaluates to nothing. Thus, we need to + * catch all relevant config debug options here. + */ +#if defined(CONFIG_KMEMCHECK) || \ + defined(CONFIG_LOCKDEP) || \ + defined(CONFIG_DEBUG_KMEMLEAK) || \ + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ + defined(CONFIG_KASAN) +static inline void slab_free_freelist_hook(struct kmem_cache *s, + void *head, void *tail) +{ + void *object = head; + void *tail_obj = tail ? : head; + + do { + slab_free_hook(s, object); + } while ((object != tail_obj) && + (object = get_freepointer(s, object))); +} +#else +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, + void *freelist_head) {} +#endif + static void setup_object(struct kmem_cache *s, struct page *page, void *object) { @@ -2586,10 +2630,11 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); * handling required then we can return immediately. */ static void __slab_free(struct kmem_cache *s, struct page *page, - void *x, unsigned long addr) + void *head, void *tail, int cnt, + unsigned long addr) + { void *prior; - void **object = (void *)x; int was_frozen; struct page new; unsigned long counters; @@ -2599,7 +2644,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page, stat(s, FREE_SLOWPATH); if (kmem_cache_debug(s) && - !(n = free_debug_processing(s, page, x, addr, &flags))) + !(n = free_debug_processing(s, page, head, tail, cnt, + addr, &flags))) return; do { @@ -2609,10 +2655,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } prior = page->freelist; counters = page->counters; - set_freepointer(s, object, prior); + set_freepointer(s, tail, prior); new.counters = counters; was_frozen = new.frozen; - new.inuse--; + new.inuse -= cnt; if ((!new.inuse || !prior) && !was_frozen) { if (kmem_cache_has_cpu_partial(s) && !prior) { @@ -2643,7 +2689,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } while (!cmpxchg_double_slab(s, page, prior, counters, - object, new.counters, + head, new.counters, "__slab_free")); if (likely(!n)) { @@ -2708,15 +2754,20 @@ slab_empty: * * If fastpath is not possible then fall back to __slab_free where we deal * with all sorts of special processing. + * + * Bulk free of a freelist with several objects (all pointing to the + * same page) possible by specifying head and tail ptr, plus objects + * count (cnt). Bulk free indicated by tail pointer being set. */ -static __always_inline void slab_free(struct kmem_cache *s, - struct page *page, void *x, unsigned long addr) +static __always_inline void slab_free(struct kmem_cache *s, struct page *page, + void *head, void *tail, int cnt, + unsigned long addr) { - void **object = (void *)x; + void *tail_obj = tail ? : head; struct kmem_cache_cpu *c; unsigned long tid; - slab_free_hook(s, x); + slab_free_freelist_hook(s, head, tail); redo: /* @@ -2735,19 +2786,19 @@ redo: barrier(); if (likely(page == c->page)) { - set_freepointer(s, object, c->freelist); + set_freepointer(s, tail_obj, c->freelist); if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, - object, next_tid(tid)))) { + head, next_tid(tid)))) { note_cmpxchg_failure("slab_free", s, tid); goto redo; } stat(s, FREE_FASTPATH); } else - __slab_free(s, page, x, addr); + __slab_free(s, page, head, tail_obj, cnt, addr); } @@ -2756,7 +2807,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x) s = cache_from_obj(s, x); if (!s) return; - slab_free(s, virt_to_head_page(x), x, _RET_IP_); + slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_); trace_kmem_cache_free(_RET_IP_, x); } EXPORT_SYMBOL(kmem_cache_free); @@ -2791,7 +2842,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) c->tid = next_tid(c->tid); local_irq_enable(); /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_); + __slab_free(s, page, object, object, 1, _RET_IP_); local_irq_disable(); c = this_cpu_ptr(s->cpu_slab); } @@ -3531,7 +3582,7 @@ void kfree(const void *x) __free_kmem_pages(page, compound_order(page)); return; } - slab_free(page->slab_cache, page, object, _RET_IP_); + slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_); } EXPORT_SYMBOL(kfree); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f47.google.com (mail-qg0-f47.google.com [209.85.192.47]) by kanga.kvack.org (Postfix) with ESMTP id 4C05F6B0261 for ; Tue, 29 Sep 2015 11:48:08 -0400 (EDT) Received: by qgev79 with SMTP id v79so9696551qge.0 for ; Tue, 29 Sep 2015 08:48:08 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id s18si21758870qks.8.2015.09.29.08.48.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 08:48:07 -0700 (PDT) Subject: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist From: Jesper Dangaard Brouer Date: Tue, 29 Sep 2015 17:48:26 +0200 Message-ID: <20150929154822.14465.50207.stgit@canyon> In-Reply-To: <20150929154605.14465.98995.stgit@canyon> References: <20150929154605.14465.98995.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton , Christoph Lameter Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim This change focus on improving the speed of object freeing in the "slowpath" of kmem_cache_free_bulk. The calls slab_free (fastpath) and __slab_free (slowpath) have been extended with support for bulk free, which amortize the overhead of the (locked) cmpxchg_double. To use the new bulking feature, we build what I call a detached freelist. The detached freelist takes advantage of three properties: 1) the free function call owns the object that is about to be freed, thus writing into this memory is synchronization-free. 2) many freelist's can co-exist side-by-side in the same slab-page each with a separate head pointer. 3) it is the visibility of the head pointer that needs synchronization. Given these properties, the brilliant part is that the detached freelist can be constructed without any need for synchronization. The freelist is constructed directly in the page objects, without any synchronization needed. The detached freelist is allocated on the stack of the function call kmem_cache_free_bulk. Thus, the freelist head pointer is not visible to other CPUs. All objects in a SLUB freelist must belong to the same slab-page. Thus, constructing the detached freelist is about matching objects that belong to the same slab-page. The bulk free array is scanned is a progressive manor with a limited look-ahead facility. Kmem debug support is handled in call of slab_free(). Notice kmem_cache_free_bulk no longer need to disable IRQs. This only slowed down single free bulk with approx 3 cycles. Performance data: Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns To get stable and comparable numbers, the kernel have been booted with "slab_merge" (this also improve performance for larger bulk sizes). Performance data, compared against fallback bulking: bulk - fallback bulk - improvement with this patch 1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0% 2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5% 3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6% 4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5% 8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0% 16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3% 30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3% 32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0% 34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0% 48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7% 64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0% 128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0% 158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7% 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4% Performance data, compared current in-kernel bulking: bulk - curr in-kernel - improvement with this patch 1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5% 2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1% 3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5% 4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1% 8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9% 16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6% 30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5% 48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0% 64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2% 128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9% 158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0% 250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0% Performance with normal SLUB merging is significantly slower for larger bulking. This is believed to (primarily) be an effect of not having to share the per-CPU data-structures, as tuning per-CPU size can achieve similar performance. bulk - slab_nomerge - normal SLUB merge 1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0 2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0 3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0 4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0 8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0 16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0 30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5 32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4 34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1 48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1 64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28 128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30 158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29 250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19 Joint work with Alexander Duyck. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexander Duyck Acked-by: Christoph Lameter --- mm/slub.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 78 insertions(+), 30 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 7c2abc33fd4e..53500f3b70ab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2812,44 +2812,92 @@ void kmem_cache_free(struct kmem_cache *s, void *x) } EXPORT_SYMBOL(kmem_cache_free); -/* Note that interrupts must be enabled when calling this function. */ -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) -{ - struct kmem_cache_cpu *c; +struct detached_freelist { struct page *page; - int i; + void *tail; + void *freelist; + int cnt; +}; - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); +/* + * This function progressively scans the array with free objects (with + * a limited look ahead) and extract objects belonging to the same + * page. It builds a detached freelist directly within the given + * page/objects. This can happen without any need for + * synchronization, because the objects are owned by running process. + * The freelist is build up as a single linked list in the objects. + * The idea is, that this detached freelist can then be bulk + * transferred to the real freelist(s), but only requiring a single + * synchronization primitive. Look ahead in the array is limited due + * to performance reasons. + */ +static int build_detached_freelist(struct kmem_cache *s, size_t size, + void **p, struct detached_freelist *df) +{ + size_t first_skipped_index = 0; + int lookahead = 3; + void *object; - for (i = 0; i < size; i++) { - void *object = p[i]; + /* Always re-init detached_freelist */ + df->page = NULL; - BUG_ON(!object); - /* kmem cache debug support */ - s = cache_from_obj(s, object); - if (unlikely(!s)) - goto exit; - slab_free_hook(s, object); + do { + object = p[--size]; + } while (!object && size); - page = virt_to_head_page(object); + if (!object) + return 0; - if (c->page == page) { - /* Fastpath: local CPU free */ - set_freepointer(s, object, c->freelist); - c->freelist = object; - } else { - c->tid = next_tid(c->tid); - local_irq_enable(); - /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, object, 1, _RET_IP_); - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); + /* Start new detached freelist */ + set_freepointer(s, object, NULL); + df->page = virt_to_head_page(object); + df->tail = object; + df->freelist = object; + p[size] = NULL; /* mark object processed */ + df->cnt = 1; + + while (size) { + object = p[--size]; + if (!object) + continue; /* Skip processed objects */ + + /* df->page is always set at this point */ + if (df->page == virt_to_head_page(object)) { + /* Opportunity build freelist */ + set_freepointer(s, object, df->freelist); + df->freelist = object; + df->cnt++; + p[size] = NULL; /* mark object processed */ + + continue; } + + /* Limit look ahead search */ + if (!--lookahead) + break; + + if (!first_skipped_index) + first_skipped_index = size + 1; } -exit: - c->tid = next_tid(c->tid); - local_irq_enable(); + + return first_skipped_index; +} + + +/* Note that interrupts must be enabled when calling this function. */ +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +{ + BUG_ON(!size); + + do { + struct detached_freelist df; + + size = build_detached_freelist(s, size, p, &df); + if (unlikely(!df.page)) + continue; + + slab_free(s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_); + } while (likely(size)); } EXPORT_SYMBOL(kmem_cache_free_bulk); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f171.google.com (mail-io0-f171.google.com [209.85.223.171]) by kanga.kvack.org (Postfix) with ESMTP id 4917C6B025B for ; Tue, 29 Sep 2015 12:38:33 -0400 (EDT) Received: by iofb144 with SMTP id b144so17222760iof.1 for ; Tue, 29 Sep 2015 09:38:33 -0700 (PDT) Received: from mail-pa0-x232.google.com (mail-pa0-x232.google.com. [2607:f8b0:400e:c03::232]) by mx.google.com with ESMTPS id w7si16232239iod.51.2015.09.29.09.38.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 09:38:32 -0700 (PDT) Received: by pablk4 with SMTP id lk4so10517564pab.3 for ; Tue, 29 Sep 2015 09:38:32 -0700 (PDT) Subject: Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists References: <20150929154605.14465.98995.stgit@canyon> <20150929154807.14465.76422.stgit@canyon> From: Alexander Duyck Message-ID: <560ABE86.9050508@gmail.com> Date: Tue, 29 Sep 2015 09:38:30 -0700 MIME-Version: 1.0 In-Reply-To: <20150929154807.14465.76422.stgit@canyon> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer , linux-mm@kvack.org, Andrew Morton , Christoph Lameter Cc: netdev@vger.kernel.org, Pekka Enberg , David Rientjes , Joonsoo Kim On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote: > Make it possible to free a freelist with several objects by adjusting > API of slab_free() and __slab_free() to have head, tail and an objects > counter (cnt). > > Tail being NULL indicate single object free of head object. This > allow compiler inline constant propagation in slab_free() and > slab_free_freelist_hook() to avoid adding any overhead in case of > single object free. > > This allows a freelist with several objects (all within the same > slab-page) to be free'ed using a single locked cmpxchg_double in > __slab_free() and with an unlocked cmpxchg_double in slab_free(). > > Object debugging on the free path is also extended to handle these > freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if > objects don't belong to the same slab-page. > > These changes are needed for the next patch to bulk free the detached > freelists it introduces and constructs. > > Micro benchmarking showed no performance reduction due to this change, > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). > > Signed-off-by: Jesper Dangaard Brouer > Signed-off-by: Alexander Duyck > > --- > V4: > - Change API per req of Christoph Lameter > - Remove comments in init_object. > > mm/slub.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 69 insertions(+), 18 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 1cf98d89546d..7c2abc33fd4e 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1063,11 +1063,15 @@ bad: > return 0; > } > > +/* Supports checking bulk free of a constructed freelist */ > static noinline struct kmem_cache_node *free_debug_processing( > - struct kmem_cache *s, struct page *page, void *object, > + struct kmem_cache *s, struct page *page, > + void *head, void *tail, int bulk_cnt, > unsigned long addr, unsigned long *flags) > { > struct kmem_cache_node *n = get_node(s, page_to_nid(page)); > + void *object = head; > + int cnt = 0; > > spin_lock_irqsave(&n->list_lock, *flags); > slab_lock(page); > @@ -1075,6 +1079,9 @@ static noinline struct kmem_cache_node *free_debug_processing( > if (!check_slab(s, page)) > goto fail; > > +next_object: > + cnt++; > + > if (!check_valid_pointer(s, page, object)) { > slab_err(s, page, "Invalid object pointer 0x%p", object); > goto fail; > @@ -1105,8 +1112,19 @@ static noinline struct kmem_cache_node *free_debug_processing( > if (s->flags & SLAB_STORE_USER) > set_track(s, object, TRACK_FREE, addr); > trace(s, page, object, 0); > + /* Freepointer not overwritten by init_object(), SLAB_POISON moved it */ > init_object(s, object, SLUB_RED_INACTIVE); > + > + /* Reached end of constructed freelist yet? */ > + if (object != tail) { > + object = get_freepointer(s, object); > + goto next_object; > + } > out: > + if (cnt != bulk_cnt) > + slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n", > + bulk_cnt, cnt); > + > slab_unlock(page); > /* > * Keep node_lock to preserve integrity > @@ -1210,7 +1228,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s, > struct page *page, void *object, unsigned long addr) { return 0; } > > static inline struct kmem_cache_node *free_debug_processing( > - struct kmem_cache *s, struct page *page, void *object, > + struct kmem_cache *s, struct page *page, > + void *head, void *tail, int bulk_cnt, > unsigned long addr, unsigned long *flags) { return NULL; } > > static inline int slab_pad_check(struct kmem_cache *s, struct page *page) > @@ -1306,6 +1325,31 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x) > kasan_slab_free(s, x); > } > > +/* Compiler cannot detect that slab_free_freelist_hook() can be > + * removed if slab_free_hook() evaluates to nothing. Thus, we need to > + * catch all relevant config debug options here. > + */ Is it actually generating nothing but a pointer walking loop or is there a bit of code cruft that is being evaluated inside the loop? > +#if defined(CONFIG_KMEMCHECK) || \ > + defined(CONFIG_LOCKDEP) || \ > + defined(CONFIG_DEBUG_KMEMLEAK) || \ > + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ > + defined(CONFIG_KASAN) > +static inline void slab_free_freelist_hook(struct kmem_cache *s, > + void *head, void *tail) > +{ > + void *object = head; > + void *tail_obj = tail ? : head; > + > + do { > + slab_free_hook(s, object); > + } while ((object != tail_obj) && > + (object = get_freepointer(s, object))); > +} > +#else > +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, > + void *freelist_head) {} > +#endif > + Instead of messing around with an #else you might just wrap the contents of slab_free_freelist_hook in the #if/#endif instead of the entire function declaration. > static void setup_object(struct kmem_cache *s, struct page *page, > void *object) > { > @@ -2586,10 +2630,11 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); > * handling required then we can return immediately. > */ > static void __slab_free(struct kmem_cache *s, struct page *page, > - void *x, unsigned long addr) > + void *head, void *tail, int cnt, > + unsigned long addr) > + > { > void *prior; > - void **object = (void *)x; > int was_frozen; > struct page new; > unsigned long counters; > @@ -2599,7 +2644,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > stat(s, FREE_SLOWPATH); > > if (kmem_cache_debug(s) && > - !(n = free_debug_processing(s, page, x, addr, &flags))) > + !(n = free_debug_processing(s, page, head, tail, cnt, > + addr, &flags))) > return; > > do { > @@ -2609,10 +2655,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > } > prior = page->freelist; > counters = page->counters; > - set_freepointer(s, object, prior); > + set_freepointer(s, tail, prior); > new.counters = counters; > was_frozen = new.frozen; > - new.inuse--; > + new.inuse -= cnt; > if ((!new.inuse || !prior) && !was_frozen) { > > if (kmem_cache_has_cpu_partial(s) && !prior) { > @@ -2643,7 +2689,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > > } while (!cmpxchg_double_slab(s, page, > prior, counters, > - object, new.counters, > + head, new.counters, > "__slab_free")); > > if (likely(!n)) { > @@ -2708,15 +2754,20 @@ slab_empty: > * > * If fastpath is not possible then fall back to __slab_free where we deal > * with all sorts of special processing. > + * > + * Bulk free of a freelist with several objects (all pointing to the > + * same page) possible by specifying head and tail ptr, plus objects > + * count (cnt). Bulk free indicated by tail pointer being set. > */ > -static __always_inline void slab_free(struct kmem_cache *s, > - struct page *page, void *x, unsigned long addr) > +static __always_inline void slab_free(struct kmem_cache *s, struct page *page, > + void *head, void *tail, int cnt, > + unsigned long addr) > { > - void **object = (void *)x; > + void *tail_obj = tail ? : head; > struct kmem_cache_cpu *c; > unsigned long tid; > > - slab_free_hook(s, x); > + slab_free_freelist_hook(s, head, tail); > > redo: > /* > @@ -2735,19 +2786,19 @@ redo: > barrier(); > > if (likely(page == c->page)) { > - set_freepointer(s, object, c->freelist); > + set_freepointer(s, tail_obj, c->freelist); > > if (unlikely(!this_cpu_cmpxchg_double( > s->cpu_slab->freelist, s->cpu_slab->tid, > c->freelist, tid, > - object, next_tid(tid)))) { > + head, next_tid(tid)))) { > > note_cmpxchg_failure("slab_free", s, tid); > goto redo; > } > stat(s, FREE_FASTPATH); > } else > - __slab_free(s, page, x, addr); > + __slab_free(s, page, head, tail_obj, cnt, addr); > > } > > @@ -2756,7 +2807,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > s = cache_from_obj(s, x); > if (!s) > return; > - slab_free(s, virt_to_head_page(x), x, _RET_IP_); > + slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_); > trace_kmem_cache_free(_RET_IP_, x); > } > EXPORT_SYMBOL(kmem_cache_free); > @@ -2791,7 +2842,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > c->tid = next_tid(c->tid); > local_irq_enable(); > /* Slowpath: overhead locked cmpxchg_double_slab */ > - __slab_free(s, page, object, _RET_IP_); > + __slab_free(s, page, object, object, 1, _RET_IP_); > local_irq_disable(); > c = this_cpu_ptr(s->cpu_slab); > } > @@ -3531,7 +3582,7 @@ void kfree(const void *x) > __free_kmem_pages(page, compound_order(page)); > return; > } > - slab_free(page->slab_cache, page, object, _RET_IP_); > + slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_); > } > EXPORT_SYMBOL(kfree); > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f47.google.com (mail-qg0-f47.google.com [209.85.192.47]) by kanga.kvack.org (Postfix) with ESMTP id CE2D66B025B for ; Tue, 29 Sep 2015 13:00:35 -0400 (EDT) Received: by qgt47 with SMTP id 47so11686433qgt.2 for ; Tue, 29 Sep 2015 10:00:35 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id o186si22027750qhb.25.2015.09.29.10.00.34 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 10:00:34 -0700 (PDT) Date: Tue, 29 Sep 2015 19:00:29 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20150929190029.01ca01f2@redhat.com> In-Reply-To: <560ABE86.9050508@gmail.com> References: <20150929154605.14465.98995.stgit@canyon> <20150929154807.14465.76422.stgit@canyon> <560ABE86.9050508@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Alexander Duyck Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Pekka Enberg , David Rientjes , Joonsoo Kim , brouer@redhat.com On Tue, 29 Sep 2015 09:38:30 -0700 Alexander Duyck wrote: > On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote: > > Make it possible to free a freelist with several objects by adjusting > > API of slab_free() and __slab_free() to have head, tail and an objects > > counter (cnt). > > > > Tail being NULL indicate single object free of head object. This > > allow compiler inline constant propagation in slab_free() and > > slab_free_freelist_hook() to avoid adding any overhead in case of > > single object free. > > > > This allows a freelist with several objects (all within the same > > slab-page) to be free'ed using a single locked cmpxchg_double in > > __slab_free() and with an unlocked cmpxchg_double in slab_free(). > > > > Object debugging on the free path is also extended to handle these > > freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if > > objects don't belong to the same slab-page. > > > > These changes are needed for the next patch to bulk free the detached > > freelists it introduces and constructs. > > > > Micro benchmarking showed no performance reduction due to this change, > > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). > > > > Signed-off-by: Jesper Dangaard Brouer > > Signed-off-by: Alexander Duyck > > > > --- > > V4: > > - Change API per req of Christoph Lameter > > - Remove comments in init_object. > > [...] > > > > +/* Compiler cannot detect that slab_free_freelist_hook() can be > > + * removed if slab_free_hook() evaluates to nothing. Thus, we need to > > + * catch all relevant config debug options here. > > + */ > > Is it actually generating nothing but a pointer walking loop or is there > a bit of code cruft that is being evaluated inside the loop? If any of the defines are activated, then slab_free_hook(s, object) will generate some code. In the case of single object free, then the compiler see that it can remove the loop, and also notice if slab_free_hook() eval to nothing. The compiler is not smart enough to remove the loop for multiobject case, even-though it can see that slab_free_hook() eval to nothing (in that case it does a pointer walk without any code eval). Thus, I need this construct. > > +#if defined(CONFIG_KMEMCHECK) || \ > > + defined(CONFIG_LOCKDEP) || \ > > + defined(CONFIG_DEBUG_KMEMLEAK) || \ > > + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ > > + defined(CONFIG_KASAN) > > +static inline void slab_free_freelist_hook(struct kmem_cache *s, > > + void *head, void *tail) > > +{ > > + void *object = head; > > + void *tail_obj = tail ? : head; > > + > > + do { > > + slab_free_hook(s, object); > > + } while ((object != tail_obj) && > > + (object = get_freepointer(s, object))); > > +} > > +#else > > +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, > > + void *freelist_head) {} > > +#endif > > + > > Instead of messing around with an #else you might just wrap the contents > of slab_free_freelist_hook in the #if/#endif instead of the entire > function declaration. I had it that way in an earlier version of the patch, but I liked better this way. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f44.google.com (mail-pa0-f44.google.com [209.85.220.44]) by kanga.kvack.org (Postfix) with ESMTP id 24CFF6B0255 for ; Tue, 29 Sep 2015 13:20:23 -0400 (EDT) Received: by pablk4 with SMTP id lk4so11407612pab.3 for ; Tue, 29 Sep 2015 10:20:22 -0700 (PDT) Received: from mail-pa0-x236.google.com (mail-pa0-x236.google.com. [2607:f8b0:400e:c03::236]) by mx.google.com with ESMTPS id ny9si38896585pbb.9.2015.09.29.10.20.22 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 10:20:22 -0700 (PDT) Received: by padhy16 with SMTP id hy16so11649351pad.1 for ; Tue, 29 Sep 2015 10:20:22 -0700 (PDT) Subject: Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists References: <20150929154605.14465.98995.stgit@canyon> <20150929154807.14465.76422.stgit@canyon> <560ABE86.9050508@gmail.com> <20150929190029.01ca01f2@redhat.com> From: Alexander Duyck Message-ID: <560AC854.6040601@gmail.com> Date: Tue, 29 Sep 2015 10:20:20 -0700 MIME-Version: 1.0 In-Reply-To: <20150929190029.01ca01f2@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Pekka Enberg , David Rientjes , Joonsoo Kim On 09/29/2015 10:00 AM, Jesper Dangaard Brouer wrote: > On Tue, 29 Sep 2015 09:38:30 -0700 > Alexander Duyck wrote: > >> On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote: >>> +#if defined(CONFIG_KMEMCHECK) || \ >>> + defined(CONFIG_LOCKDEP) || \ >>> + defined(CONFIG_DEBUG_KMEMLEAK) || \ >>> + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ >>> + defined(CONFIG_KASAN) >>> +static inline void slab_free_freelist_hook(struct kmem_cache *s, >>> + void *head, void *tail) >>> +{ >>> + void *object = head; >>> + void *tail_obj = tail ? : head; >>> + >>> + do { >>> + slab_free_hook(s, object); >>> + } while ((object != tail_obj) && >>> + (object = get_freepointer(s, object))); >>> +} >>> +#else >>> +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, >>> + void *freelist_head) {} >>> +#endif >>> + >> Instead of messing around with an #else you might just wrap the contents >> of slab_free_freelist_hook in the #if/#endif instead of the entire >> function declaration. > I had it that way in an earlier version of the patch, but I liked > better this way. It would be nice if the argument names were the same for both cases. Having the names differ will make it more difficult to maintain when changes need to be made to the function. - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f54.google.com (mail-qg0-f54.google.com [209.85.192.54]) by kanga.kvack.org (Postfix) with ESMTP id E927D6B0258 for ; Tue, 29 Sep 2015 14:16:11 -0400 (EDT) Received: by qgez77 with SMTP id z77so13829990qge.1 for ; Tue, 29 Sep 2015 11:16:11 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id n7si22374322qge.126.2015.09.29.11.16.11 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Sep 2015 11:16:11 -0700 (PDT) Date: Tue, 29 Sep 2015 20:16:05 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20150929201605.18626b1b@redhat.com> In-Reply-To: <560AC854.6040601@gmail.com> References: <20150929154605.14465.98995.stgit@canyon> <20150929154807.14465.76422.stgit@canyon> <560ABE86.9050508@gmail.com> <20150929190029.01ca01f2@redhat.com> <560AC854.6040601@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Alexander Duyck Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Pekka Enberg , David Rientjes , Joonsoo Kim , brouer@redhat.com On Tue, 29 Sep 2015 10:20:20 -0700 Alexander Duyck wrote: > On 09/29/2015 10:00 AM, Jesper Dangaard Brouer wrote: > > On Tue, 29 Sep 2015 09:38:30 -0700 > > Alexander Duyck wrote: > > > >> On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote: > >>> +#if defined(CONFIG_KMEMCHECK) || \ > >>> + defined(CONFIG_LOCKDEP) || \ > >>> + defined(CONFIG_DEBUG_KMEMLEAK) || \ > >>> + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ > >>> + defined(CONFIG_KASAN) > >>> +static inline void slab_free_freelist_hook(struct kmem_cache *s, > >>> + void *head, void *tail) > >>> +{ > >>> + void *object = head; > >>> + void *tail_obj = tail ? : head; > >>> + > >>> + do { > >>> + slab_free_hook(s, object); > >>> + } while ((object != tail_obj) && > >>> + (object = get_freepointer(s, object))); > >>> +} > >>> +#else > >>> +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, > >>> + void *freelist_head) {} > >>> +#endif > >>> + > >> Instead of messing around with an #else you might just wrap the contents > >> of slab_free_freelist_hook in the #if/#endif instead of the entire > >> function declaration. > > > > I had it that way in an earlier version of the patch, but I liked > > better this way. > > It would be nice if the argument names were the same for both cases. > Having the names differ will make it more difficult to maintain when > changes need to be made to the function. Nice spotted, I forgot to change arg names of the empty function, when I updated the patch. Guess, it is an argument for moving the "if defined()" into the function body. It just looked strange to have such a big ifdef block inside the function. I also earlier had it define another def and use that inside the function, but then the code-reader would not know if this new def was/could-be used later (nitpicking alert...) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f54.google.com (mail-qg0-f54.google.com [209.85.192.54]) by kanga.kvack.org (Postfix) with ESMTP id F2B056B0259 for ; Wed, 30 Sep 2015 07:44:00 -0400 (EDT) Received: by qgt47 with SMTP id 47so31826517qgt.2 for ; Wed, 30 Sep 2015 04:44:00 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id c2si166532qkh.34.2015.09.30.04.44.00 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 30 Sep 2015 04:44:00 -0700 (PDT) Subject: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists From: Jesper Dangaard Brouer Date: Wed, 30 Sep 2015 13:44:19 +0200 Message-ID: <20150930114255.13505.2618.stgit@canyon> In-Reply-To: <560ABE86.9050508@gmail.com> References: <560ABE86.9050508@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , Alexander Duyck Cc: Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , Jesper Dangaard Brouer Make it possible to free a freelist with several objects by adjusting API of slab_free() and __slab_free() to have head, tail and an objects counter (cnt). Tail being NULL indicate single object free of head object. This allow compiler inline constant propagation in slab_free() and slab_free_freelist_hook() to avoid adding any overhead in case of single object free. This allows a freelist with several objects (all within the same slab-page) to be free'ed using a single locked cmpxchg_double in __slab_free() and with an unlocked cmpxchg_double in slab_free(). Object debugging on the free path is also extended to handle these freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if objects don't belong to the same slab-page. These changes are needed for the next patch to bulk free the detached freelists it introduces and constructs. Micro benchmarking showed no performance reduction due to this change, when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexander Duyck --- V4: - Change API per req of Christoph Lameter - Remove comments in init_object. V4.1: - Took Alex'es approach on defines inside slab_free_freelist_hook() mm/slub.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 67 insertions(+), 18 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 1cf98d89546d..99fcfa8ed0c7 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1063,11 +1063,15 @@ bad: return 0; } +/* Supports checking bulk free of a constructed freelist */ static noinline struct kmem_cache_node *free_debug_processing( - struct kmem_cache *s, struct page *page, void *object, + struct kmem_cache *s, struct page *page, + void *head, void *tail, int bulk_cnt, unsigned long addr, unsigned long *flags) { struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + void *object = head; + int cnt = 0; spin_lock_irqsave(&n->list_lock, *flags); slab_lock(page); @@ -1075,6 +1079,9 @@ static noinline struct kmem_cache_node *free_debug_processing( if (!check_slab(s, page)) goto fail; +next_object: + cnt++; + if (!check_valid_pointer(s, page, object)) { slab_err(s, page, "Invalid object pointer 0x%p", object); goto fail; @@ -1105,8 +1112,19 @@ static noinline struct kmem_cache_node *free_debug_processing( if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); trace(s, page, object, 0); + /* Freepointer not overwritten by init_object(), SLAB_POISON moved it */ init_object(s, object, SLUB_RED_INACTIVE); + + /* Reached end of constructed freelist yet? */ + if (object != tail) { + object = get_freepointer(s, object); + goto next_object; + } out: + if (cnt != bulk_cnt) + slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n", + bulk_cnt, cnt); + slab_unlock(page); /* * Keep node_lock to preserve integrity @@ -1210,7 +1228,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s, struct page *page, void *object, unsigned long addr) { return 0; } static inline struct kmem_cache_node *free_debug_processing( - struct kmem_cache *s, struct page *page, void *object, + struct kmem_cache *s, struct page *page, + void *head, void *tail, int bulk_cnt, unsigned long addr, unsigned long *flags) { return NULL; } static inline int slab_pad_check(struct kmem_cache *s, struct page *page) @@ -1306,6 +1325,29 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x) kasan_slab_free(s, x); } +static inline void slab_free_freelist_hook(struct kmem_cache *s, + void *head, void *tail) +{ +/* + * Compiler cannot detect this function can be removed if slab_free_hook() + * evaluates to nothing. Thus, catch all relevant config debug options here. + */ +#if defined(CONFIG_KMEMCHECK) || \ + defined(CONFIG_LOCKDEP) || \ + defined(CONFIG_DEBUG_KMEMLEAK) || \ + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ + defined(CONFIG_KASAN) + + void *object = head; + void *tail_obj = tail ? : head; + + do { + slab_free_hook(s, object); + } while ((object != tail_obj) && + (object = get_freepointer(s, object))); +#endif +} + static void setup_object(struct kmem_cache *s, struct page *page, void *object) { @@ -2586,10 +2628,11 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); * handling required then we can return immediately. */ static void __slab_free(struct kmem_cache *s, struct page *page, - void *x, unsigned long addr) + void *head, void *tail, int cnt, + unsigned long addr) + { void *prior; - void **object = (void *)x; int was_frozen; struct page new; unsigned long counters; @@ -2599,7 +2642,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page, stat(s, FREE_SLOWPATH); if (kmem_cache_debug(s) && - !(n = free_debug_processing(s, page, x, addr, &flags))) + !(n = free_debug_processing(s, page, head, tail, cnt, + addr, &flags))) return; do { @@ -2609,10 +2653,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } prior = page->freelist; counters = page->counters; - set_freepointer(s, object, prior); + set_freepointer(s, tail, prior); new.counters = counters; was_frozen = new.frozen; - new.inuse--; + new.inuse -= cnt; if ((!new.inuse || !prior) && !was_frozen) { if (kmem_cache_has_cpu_partial(s) && !prior) { @@ -2643,7 +2687,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } while (!cmpxchg_double_slab(s, page, prior, counters, - object, new.counters, + head, new.counters, "__slab_free")); if (likely(!n)) { @@ -2708,15 +2752,20 @@ slab_empty: * * If fastpath is not possible then fall back to __slab_free where we deal * with all sorts of special processing. + * + * Bulk free of a freelist with several objects (all pointing to the + * same page) possible by specifying head and tail ptr, plus objects + * count (cnt). Bulk free indicated by tail pointer being set. */ -static __always_inline void slab_free(struct kmem_cache *s, - struct page *page, void *x, unsigned long addr) +static __always_inline void slab_free(struct kmem_cache *s, struct page *page, + void *head, void *tail, int cnt, + unsigned long addr) { - void **object = (void *)x; + void *tail_obj = tail ? : head; struct kmem_cache_cpu *c; unsigned long tid; - slab_free_hook(s, x); + slab_free_freelist_hook(s, head, tail); redo: /* @@ -2735,19 +2784,19 @@ redo: barrier(); if (likely(page == c->page)) { - set_freepointer(s, object, c->freelist); + set_freepointer(s, tail_obj, c->freelist); if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, - object, next_tid(tid)))) { + head, next_tid(tid)))) { note_cmpxchg_failure("slab_free", s, tid); goto redo; } stat(s, FREE_FASTPATH); } else - __slab_free(s, page, x, addr); + __slab_free(s, page, head, tail_obj, cnt, addr); } @@ -2756,7 +2805,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x) s = cache_from_obj(s, x); if (!s) return; - slab_free(s, virt_to_head_page(x), x, _RET_IP_); + slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_); trace_kmem_cache_free(_RET_IP_, x); } EXPORT_SYMBOL(kmem_cache_free); @@ -2791,7 +2840,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) c->tid = next_tid(c->tid); local_irq_enable(); /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_); + __slab_free(s, page, object, object, 1, _RET_IP_); local_irq_disable(); c = this_cpu_ptr(s->cpu_slab); } @@ -3531,7 +3580,7 @@ void kfree(const void *x) __free_kmem_pages(page, compound_order(page)); return; } - slab_free(page->slab_cache, page, object, _RET_IP_); + slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_); } EXPORT_SYMBOL(kfree); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f171.google.com (mail-ig0-f171.google.com [209.85.213.171]) by kanga.kvack.org (Postfix) with ESMTP id 0F9456B0277 for ; Wed, 30 Sep 2015 12:03:09 -0400 (EDT) Received: by igcrk20 with SMTP id rk20so105754716igc.1 for ; Wed, 30 Sep 2015 09:03:08 -0700 (PDT) Received: from resqmta-ch2-12v.sys.comcast.net (resqmta-ch2-12v.sys.comcast.net. [2001:558:fe21:29:69:252:207:44]) by mx.google.com with ESMTPS id p19si21286872igs.100.2015.09.30.09.03.07 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 30 Sep 2015 09:03:08 -0700 (PDT) Date: Wed, 30 Sep 2015 11:03:06 -0500 (CDT) From: Christoph Lameter Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists In-Reply-To: <20150930114255.13505.2618.stgit@canyon> Message-ID: References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes On Wed, 30 Sep 2015, Jesper Dangaard Brouer wrote: > Make it possible to free a freelist with several objects by adjusting > API of slab_free() and __slab_free() to have head, tail and an objects > counter (cnt). Acked-by: Christoph Lameter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f178.google.com (mail-qk0-f178.google.com [209.85.220.178]) by kanga.kvack.org (Postfix) with ESMTP id 0D06582F87 for ; Thu, 1 Oct 2015 18:10:18 -0400 (EDT) Received: by qkcf65 with SMTP id f65so35697409qkc.3 for ; Thu, 01 Oct 2015 15:10:17 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id a12si7582230qkj.82.2015.10.01.15.10.17 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 01 Oct 2015 15:10:17 -0700 (PDT) Date: Thu, 1 Oct 2015 15:10:15 -0700 From: Andrew Morton Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-Id: <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> In-Reply-To: <20150930114255.13505.2618.stgit@canyon> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer wrote: > Make it possible to free a freelist with several objects by adjusting > API of slab_free() and __slab_free() to have head, tail and an objects > counter (cnt). > > Tail being NULL indicate single object free of head object. This > allow compiler inline constant propagation in slab_free() and > slab_free_freelist_hook() to avoid adding any overhead in case of > single object free. > > This allows a freelist with several objects (all within the same > slab-page) to be free'ed using a single locked cmpxchg_double in > __slab_free() and with an unlocked cmpxchg_double in slab_free(). > > Object debugging on the free path is also extended to handle these > freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if > objects don't belong to the same slab-page. > > These changes are needed for the next patch to bulk free the detached > freelists it introduces and constructs. > > Micro benchmarking showed no performance reduction due to this change, > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). > checkpatch says WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() #205: FILE: mm/slub.c:2888: + BUG_ON(!size); Linus will get mad at you if he finds out, and we wouldn't want that. --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix +++ a/mm/slub.c @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc /* Note that interrupts must be enabled when calling this function. */ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) { - BUG_ON(!size); + if (WARN_ON(!size)) + return; do { struct detached_freelist df; _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f53.google.com (mail-qg0-f53.google.com [209.85.192.53]) by kanga.kvack.org (Postfix) with ESMTP id B149E82F99 for ; Fri, 2 Oct 2015 05:41:24 -0400 (EDT) Received: by qgev79 with SMTP id v79so89698337qge.0 for ; Fri, 02 Oct 2015 02:41:24 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q72si9566877qkl.121.2015.10.02.02.41.23 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Oct 2015 02:41:23 -0700 (PDT) Date: Fri, 2 Oct 2015 11:41:18 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151002114118.75aae2f9@redhat.com> In-Reply-To: <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , brouer@redhat.com On Thu, 1 Oct 2015 15:10:15 -0700 Andrew Morton wrote: > On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer wrote: > > > Make it possible to free a freelist with several objects by adjusting > > API of slab_free() and __slab_free() to have head, tail and an objects > > counter (cnt). > > > > Tail being NULL indicate single object free of head object. This > > allow compiler inline constant propagation in slab_free() and > > slab_free_freelist_hook() to avoid adding any overhead in case of > > single object free. > > > > This allows a freelist with several objects (all within the same > > slab-page) to be free'ed using a single locked cmpxchg_double in > > __slab_free() and with an unlocked cmpxchg_double in slab_free(). > > > > Object debugging on the free path is also extended to handle these > > freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if > > objects don't belong to the same slab-page. > > > > These changes are needed for the next patch to bulk free the detached > > freelists it introduces and constructs. > > > > Micro benchmarking showed no performance reduction due to this change, > > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). > > > > checkpatch says > > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() > #205: FILE: mm/slub.c:2888: > + BUG_ON(!size); > > > Linus will get mad at you if he finds out, and we wouldn't want that. > > --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix > +++ a/mm/slub.c > @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc > /* Note that interrupts must be enabled when calling this function. */ > void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > { > - BUG_ON(!size); > + if (WARN_ON(!size)) > + return; > > do { > struct detached_freelist df; > _ My problem with this change is that WARN_ON generates (slightly) larger code size, which is critical for instruction-cache usage... [net-next-mm]$ ./scripts/bloat-o-meter vmlinux-with_BUG_ON vmlinux-with_WARN_ON add/remove: 0/0 grow/shrink: 1/0 up/down: 17/0 (17) function old new delta kmem_cache_free_bulk 438 455 +17 My IP-forwarding benchmark is actually a very challenging use-case, because the code path "size" a packet have to travel is larger than the instruction-cache of the CPU. Thus, I need introducing new code like this patch and at the same time have to reduce the number of instruction-cache misses/usage. In this case we solve the problem by kmem_cache_free_bulk() not getting called too often. Thus, +17 bytes will hopefully not matter too much... but on the other hand we sort-of know that calling kmem_cache_free_bulk() will cause icache misses. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f172.google.com (mail-ig0-f172.google.com [209.85.213.172]) by kanga.kvack.org (Postfix) with ESMTP id C7BC16B02A0 for ; Fri, 2 Oct 2015 06:10:04 -0400 (EDT) Received: by igcpb10 with SMTP id pb10so15031130igc.1 for ; Fri, 02 Oct 2015 03:10:04 -0700 (PDT) Received: from resqmta-ch2-11v.sys.comcast.net (resqmta-ch2-11v.sys.comcast.net. [2001:558:fe21:29:69:252:207:43]) by mx.google.com with ESMTPS id b9si5487644igl.46.2015.10.02.03.10.03 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Fri, 02 Oct 2015 03:10:03 -0700 (PDT) Date: Fri, 2 Oct 2015 05:10:02 -0500 (CDT) From: Christoph Lameter Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists In-Reply-To: <20151002114118.75aae2f9@redhat.com> Message-ID: References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: Andrew Morton , linux-mm@kvack.org, Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes On Fri, 2 Oct 2015, Jesper Dangaard Brouer wrote: > Thus, I need introducing new code like this patch and at the same time > have to reduce the number of instruction-cache misses/usage. In this > case we solve the problem by kmem_cache_free_bulk() not getting called > too often. Thus, +17 bytes will hopefully not matter too much... but on > the other hand we sort-of know that calling kmem_cache_free_bulk() will > cause icache misses. Can we just drop the WARN/BUG here? Nothing untoward happens if size == 0 right? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f43.google.com (mail-qg0-f43.google.com [209.85.192.43]) by kanga.kvack.org (Postfix) with ESMTP id E483082F99 for ; Fri, 2 Oct 2015 06:40:07 -0400 (EDT) Received: by qgx61 with SMTP id 61so90814226qgx.3 for ; Fri, 02 Oct 2015 03:40:07 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id z37si3589927qkg.87.2015.10.02.03.40.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Oct 2015 03:40:07 -0700 (PDT) Date: Fri, 2 Oct 2015 12:40:01 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151002124001.2c1d9d7d@redhat.com> In-Reply-To: References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Andrew Morton , linux-mm@kvack.org, Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , brouer@redhat.com On Fri, 2 Oct 2015 05:10:02 -0500 (CDT) Christoph Lameter wrote: > On Fri, 2 Oct 2015, Jesper Dangaard Brouer wrote: > > > Thus, I need introducing new code like this patch and at the same time > > have to reduce the number of instruction-cache misses/usage. In this > > case we solve the problem by kmem_cache_free_bulk() not getting called > > too often. Thus, +17 bytes will hopefully not matter too much... but on > > the other hand we sort-of know that calling kmem_cache_free_bulk() will > > cause icache misses. > > Can we just drop the WARN/BUG here? Nothing untoward happens if size == 0 > right? I think we crash if size == 0, as we deref p[--size] in build_detached_freelist(). (aside note: The code do handles if pointers in the p array are NULL) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f169.google.com (mail-qk0-f169.google.com [209.85.220.169]) by kanga.kvack.org (Postfix) with ESMTP id DC8CA6B0297 for ; Fri, 2 Oct 2015 09:40:48 -0400 (EDT) Received: by qkbi190 with SMTP id i190so22379368qkb.1 for ; Fri, 02 Oct 2015 06:40:48 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id f47si10317454qge.78.2015.10.02.06.40.47 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Oct 2015 06:40:47 -0700 (PDT) Date: Fri, 2 Oct 2015 15:40:39 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151002154039.69f82bdc@redhat.com> In-Reply-To: <20151002114118.75aae2f9@redhat.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , brouer@redhat.com, Hannes Frederic Sowa On Fri, 2 Oct 2015 11:41:18 +0200 Jesper Dangaard Brouer wrote: > On Thu, 1 Oct 2015 15:10:15 -0700 > Andrew Morton wrote: > > > On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer wrote: > > > > > Make it possible to free a freelist with several objects by adjusting > > > API of slab_free() and __slab_free() to have head, tail and an objects > > > counter (cnt). > > > > > > Tail being NULL indicate single object free of head object. This > > > allow compiler inline constant propagation in slab_free() and > > > slab_free_freelist_hook() to avoid adding any overhead in case of > > > single object free. > > > > > > This allows a freelist with several objects (all within the same > > > slab-page) to be free'ed using a single locked cmpxchg_double in > > > __slab_free() and with an unlocked cmpxchg_double in slab_free(). > > > > > > Object debugging on the free path is also extended to handle these > > > freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if > > > objects don't belong to the same slab-page. > > > > > > These changes are needed for the next patch to bulk free the detached > > > freelists it introduces and constructs. > > > > > > Micro benchmarking showed no performance reduction due to this change, > > > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). > > > > > > > checkpatch says > > > > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() > > #205: FILE: mm/slub.c:2888: > > + BUG_ON(!size); > > > > > > Linus will get mad at you if he finds out, and we wouldn't want that. > > > > --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix > > +++ a/mm/slub.c > > @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc > > /* Note that interrupts must be enabled when calling this function. */ > > void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > > { > > - BUG_ON(!size); > > + if (WARN_ON(!size)) > > + return; > > > > do { > > struct detached_freelist df; > > _ > > My problem with this change is that WARN_ON generates (slightly) larger > code size, which is critical for instruction-cache usage... > > [net-next-mm]$ ./scripts/bloat-o-meter vmlinux-with_BUG_ON vmlinux-with_WARN_ON > add/remove: 0/0 grow/shrink: 1/0 up/down: 17/0 (17) > function old new delta > kmem_cache_free_bulk 438 455 +17 > > My IP-forwarding benchmark is actually a very challenging use-case, > because the code path "size" a packet have to travel is larger than the > instruction-cache of the CPU. > > Thus, I need introducing new code like this patch and at the same time > have to reduce the number of instruction-cache misses/usage. In this > case we solve the problem by kmem_cache_free_bulk() not getting called > too often. Thus, +17 bytes will hopefully not matter too much... but on > the other hand we sort-of know that calling kmem_cache_free_bulk() will > cause icache misses. I just tested this change on top of my net-use-case patchset... and for some strange reason the code with this WARN_ON is faster and have much less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses). Thus, I think we should keep your fix. I cannot explain why using WARN_ON() is better and cause less icache misses. And I hate when I don't understand every detail. My theory is, after reading the assembler code, that the UD2 instruction (from BUG_ON) cause some kind of icache decoder stall (Intel experts???). Now that should not be a problem, as UD2 is obviously placed as an unlikely branch and left at the end of the asm function call. But the call to __slab_free() is also placed at the end of the asm function (gets inlined from slab_free() as unlikely). And it is actually fairly likely that bulking is calling __slab_free (slub slowpath call). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f177.google.com (mail-qk0-f177.google.com [209.85.220.177]) by kanga.kvack.org (Postfix) with ESMTP id B7E3C4402FE for ; Fri, 2 Oct 2015 17:50:47 -0400 (EDT) Received: by qkas79 with SMTP id s79so48672845qka.0 for ; Fri, 02 Oct 2015 14:50:47 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id 79si12127977qgc.83.2015.10.02.14.50.46 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Oct 2015 14:50:46 -0700 (PDT) Date: Fri, 2 Oct 2015 14:50:44 -0700 From: Andrew Morton Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-Id: <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> In-Reply-To: <20151002154039.69f82bdc@redhat.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , Hannes Frederic Sowa On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer wrote: > > Thus, I need introducing new code like this patch and at the same time > > have to reduce the number of instruction-cache misses/usage. In this > > case we solve the problem by kmem_cache_free_bulk() not getting called > > too often. Thus, +17 bytes will hopefully not matter too much... but on > > the other hand we sort-of know that calling kmem_cache_free_bulk() will > > cause icache misses. > > I just tested this change on top of my net-use-case patchset... and for > some strange reason the code with this WARN_ON is faster and have much > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses). > > Thus, I think we should keep your fix. > > I cannot explain why using WARN_ON() is better and cause less icache > misses. And I hate when I don't understand every detail. > > My theory is, after reading the assembler code, that the UD2 > instruction (from BUG_ON) cause some kind of icache decoder stall > (Intel experts???). Now that should not be a problem, as UD2 is > obviously placed as an unlikely branch and left at the end of the asm > function call. But the call to __slab_free() is also placed at the end > of the asm function (gets inlined from slab_free() as unlikely). And > it is actually fairly likely that bulking is calling __slab_free (slub > slowpath call). Yes, I was looking at the asm code and the difference is pretty small: a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly. But I wouldn't assume that the microbenchmarking is meaningful. I've seen shockingly large (and quite repeatable) microbenchmarking differences from small changes in code which isn't even executed (and this is one such case, actually). You add or remove just one byte of text and half the kernel (or half the .o file?) gets a different alignment and this seems to change everything. Deleting the BUG altogether sounds the best solution. As long as the kernel crashes in some manner, we'll be able to work out what happened. And it's cant-happen anyway, isn't it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 0F01B82F66 for ; Mon, 5 Oct 2015 15:26:46 -0400 (EDT) Received: by pacfv12 with SMTP id fv12so188815839pac.2 for ; Mon, 05 Oct 2015 12:26:45 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id kg6si18345799pad.12.2015.10.05.12.26.44 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Oct 2015 12:26:45 -0700 (PDT) Date: Mon, 5 Oct 2015 21:26:39 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151005212639.35932b6c@redhat.com> In-Reply-To: <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , Hannes Frederic Sowa , brouer@redhat.com, Andi Kleen , Arnaldo Carvalho de Melo On Fri, 2 Oct 2015 14:50:44 -0700 Andrew Morton wrote: > On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer wrote: >=20 > > > Thus, I need introducing new code like this patch and at the same time > > > have to reduce the number of instruction-cache misses/usage. In this > > > case we solve the problem by kmem_cache_free_bulk() not getting called > > > too often. Thus, +17 bytes will hopefully not matter too much... but = on > > > the other hand we sort-of know that calling kmem_cache_free_bulk() wi= ll > > > cause icache misses. > >=20 > > I just tested this change on top of my net-use-case patchset... and for > > some strange reason the code with this WARN_ON is faster and have much > > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses). > >=20 > > Thus, I think we should keep your fix. > >=20 > > I cannot explain why using WARN_ON() is better and cause less icache > > misses. And I hate when I don't understand every detail. > >=20 > > My theory is, after reading the assembler code, that the UD2 > > instruction (from BUG_ON) cause some kind of icache decoder stall > > (Intel experts???). Now that should not be a problem, as UD2 is > > obviously placed as an unlikely branch and left at the end of the asm > > function call. But the call to __slab_free() is also placed at the end > > of the asm function (gets inlined from slab_free() as unlikely). And > > it is actually fairly likely that bulking is calling __slab_free (slub > > slowpath call). >=20 > Yes, I was looking at the asm code and the difference is pretty small: > a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly. Actually managed to find documentation that the UD2 instruction do stops the instruction prefetcher. =46rom "Intel=C2=AE 64 and IA-32 Architectures Optimization Reference Manual" Section 3.4.1.6 "Branch Type Selection": Cite: "[...] follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path." And from LWN article: https://lwn.net/Articles/255364/ Cite: "[...] like using the ud2 instruction [...]. This instruction, which cannot be executed itself, is used after an indirect jump instruction; it is used as a signal to the instruction fetcher that the processor should not waste efforts decoding the following memory since the execution will continue at a different location." Thus, we were hitting a very unfortunately combination of code, there the UD2 instruction was located before the call to __slab_free() in the ASM code, causing instruction prefetching to fail/stop. Now I understand all the details again :-) My only problem left, is I want a perf measurement that pinpoint these kind of spots. The difference in L1-icache-load-misses were significant (1,278,276 vs 2,719,158). I tried to somehow perf record this with different perf events without being able to pinpoint the location (even though I know the spot now). Even tried Andi's ocperf.py... maybe he will know what event I should try? > But I wouldn't assume that the microbenchmarking is meaningful. I've > seen shockingly large (and quite repeatable) microbenchmarking > differences from small changes in code which isn't even executed (and > this is one such case, actually). You add or remove just one byte of > text and half the kernel (or half the .o file?) gets a different > alignment and this seems to change everything. I do know alignment of code is important, but this performance impact was just too large. And I think I found documentation to back my theory. =20 > Deleting the BUG altogether sounds the best solution. As long as the > kernel crashes in some manner, we'll be able to work out what happened. > And it's cant-happen anyway, isn't it? To me WARN_ON() seems like a good "documentation" if it does not hurt performance. I don't think removing the WARN_ON() will improve performance, but I'm willing to actually test if it matters. Below is how it will crash, it is fairly obvious (if you know x86_64 call convention).=20 API is: void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void *= *p) Below it clearly shows second arg "size" in %RSI =3D=3D 0. [ 2986.977708] BUG: unable to handle kernel paging request at ffffffffa03c4= d58 [ 2986.977760] IP: [] kmem_cache_free_bulk+0x54/0x1b0 [ 2986.977805] PGD 19a4067 PUD 19a5063 PMD 408a66067 PTE 3f6407161 [ 2986.977879] Oops: 0003 [#1] SMP=20 [ 2986.977936] Modules linked in: slab_bulk_test03(O+) time_bench(O) netcon= sole tun coretemp kvm_intel kvm mxm_wmi microcode sg i2c_i801 i2c_core pcsp= kr wmi video shpchp acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd gr= ace sunrpc serio_raw sd_mod hid_generic ixgbe mdio vxlan ip6_udp_tunnel e10= 00e udp_tunnel ptp pps_core mlx5_core [last unloaded: slab_bulk_test03] [ 2986.978190] CPU: 1 PID: 12495 Comm: modprobe Tainted: G O 4= .3.0-rc2-net-next-mm+ #442 [ 2986.978265] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M.= /Z97 Extreme4, BIOS P2.10 05/12/2015 [ 2986.978341] task: ffff88040a89d200 ti: ffff8803e3c50000 task.ti: ffff880= 3e3c50000 [ 2986.978414] RIP: 0010:[] [] kmem_ca= che_free_bulk+0x54/0x1b0 [ 2986.978490] RSP: 0018:ffff8803e3c53aa8 EFLAGS: 00010282 [ 2986.978528] RAX: 0000000000000000 RBX: ffff8803e3c53c30 RCX: 00000001002= 00016 [ 2986.978571] RDX: ffff8803e3c53ad8 RSI: 0000000000000000 RDI: 000077ff800= 00000 [ 2986.978614] RBP: ffff8803e3c53ad0 R08: ffff8803e3db3600 R09: 00000000e3d= b3b01 [ 2986.978655] R10: ffffffffa03c4d58 R11: 0000000000000001 R12: fffffffffff= fffff [ 2986.978698] R13: ffff8803e3c53ae0 R14: 000077ff80000000 R15: ffff88040d7= 74c00 [ 2986.978740] FS: 00007fbf31190740(0000) GS:ffff88041fa40000(0000) knlGS:= 0000000000000000 [ 2986.978814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2986.978853] CR2: ffffffffa03c4d58 CR3: 00000003f6406000 CR4: 00000000001= 406e0 [ 2986.978895] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000= 00000 [ 2986.978938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 00000000000= 00400 [ 2986.978979] Stack: [ 2986.979014] ffff8803e3c53c30 0000000000000009 0000000000000000 ffffffff= a002e000 [ 2986.979099] ffff8803fc40db40 ffff8803e3c53cf0 ffffffffa03c4d58 00000000= 00000000 [ 2986.979184] 0000000000000000 0000000000000000 0000000000000000 00000000= 00000000 [ 2986.979259] Call Trace: [ 2986.979296] [] ? 0xffffffffa002e000 [ 2986.979335] [] run_try_crash_tests+0x378/0x3a6 [slab_= bulk_test03] [ 2986.979409] [] slab_bulk_test03_module_init+0x12d/0x1= 000 [slab_bulk_test03] [ 2986.979485] [] do_one_initcall+0xad/0x1d0 [ 2986.979526] [] do_init_module+0x60/0x1e8 [ 2986.979567] [] load_module+0x1bb8/0x2420 [ 2986.979608] [] ? __symbol_put+0x40/0x40 [ 2986.979647] [] SyS_finit_module+0x80/0xb0 [ 2986.979690] [] entry_SYSCALL_64_fastpath+0x12/0x6a [ 2986.979732] Code: f8 eb 05 4d 85 e4 74 13 4c 8b 10 49 83 ec 01 48 89 c2 = 48 83 e8 08 4d 85 d2 74 e8 4d 85 d2 0f 84 2e 01 00 00 49 63 47 20 4c 89 f7 = <49> c7 04 02 00 00 00 00 b8 00 00 00 80 4c 01 d0 48 0f 42 3d b4=20 [ 2986.979904] RIP [] kmem_cache_free_bulk+0x54/0x1b0 [ 2986.979945] RSP [ 2986.979982] CR2: ffffffffa03c4d58 [ 2986.980130] ---[ end trace 64f42b02f4347220 ]--- (gdb) list *(kmem_cache_free_bulk)+0x54 0xffffffff8116d044 is in kmem_cache_free_bulk (mm/slub.c:269). 264 return p; 265 } 266=09 267 static inline void set_freepointer(struct kmem_cache *s, void *object, = void *fp) 268 { 269 *(void **)(object + s->offset) =3D fp; 270 } 271=09 272 /* Loop over all objects in a slab */ 273 #define for_each_object(__p, __s, __addr, __objects) \ --=20 Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f48.google.com (mail-pa0-f48.google.com [209.85.220.48]) by kanga.kvack.org (Postfix) with ESMTP id 4CC7282FA8 for ; Mon, 5 Oct 2015 17:20:47 -0400 (EDT) Received: by pablk4 with SMTP id lk4so186525333pab.3 for ; Mon, 05 Oct 2015 14:20:47 -0700 (PDT) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id kb6si43134417pbc.7.2015.10.05.14.20.46 for ; Mon, 05 Oct 2015 14:20:46 -0700 (PDT) Date: Mon, 5 Oct 2015 14:20:45 -0700 From: Andi Kleen Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151005212045.GG26924@tassilo.jf.intel.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151005212639.35932b6c@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: Andrew Morton , linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , Hannes Frederic Sowa , Arnaldo Carvalho de Melo > My only problem left, is I want a perf measurement that pinpoint these > kind of spots. The difference in L1-icache-load-misses were significant > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > different perf events without being able to pinpoint the location (even > though I know the spot now). Even tried Andi's ocperf.py... maybe he > will know what event I should try? Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the bottle neck is and what to sample for if there is a suitable event and even prints the command line. https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev However frontend issues are difficult to sample, as they happen very far away from instruction retirement where the sampling happens. So you may have large skid and the sampling points may be far away. Skylake has new special FRONTEND_* PEBS events for this, but before it was often difficult. BTW if your main goal is icache; I wrote a gcc patch to help the kernel by enabling function splitting: Apply the patch in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66890 to gcc 5, make sure 9bebe9e5b0f (now in mainline) is applied and build with -freorder-blocks-and-partition. That will split all functions into statically predicted hot and cold parts and generally relieves icache pressure. Any testing of this on your workload welcome. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by kanga.kvack.org (Postfix) with ESMTP id D1C9082F66 for ; Mon, 5 Oct 2015 19:07:08 -0400 (EDT) Received: by pacfv12 with SMTP id fv12so193389661pac.2 for ; Mon, 05 Oct 2015 16:07:08 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id w14si43708970pbt.201.2015.10.05.16.07.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Oct 2015 16:07:07 -0700 (PDT) Date: Tue, 6 Oct 2015 01:07:03 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151006010703.09e2f0ff@redhat.com> In-Reply-To: <20151005212045.GG26924@tassilo.jf.intel.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> <20151005212045.GG26924@tassilo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andi Kleen Cc: linux-mm@kvack.org, Christoph Lameter , netdev@vger.kernel.org, Arnaldo Carvalho de Melo , brouer@redhat.com (trimmed Cc list a little) On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen wrote: > > My only problem left, is I want a perf measurement that pinpoint these > > kind of spots. The difference in L1-icache-load-misses were significant > > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > > different perf events without being able to pinpoint the location (even > > though I know the spot now). Even tried Andi's ocperf.py... maybe he > > will know what event I should try? > > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the > bottle neck is and what to sample for if there is a suitable event and > even prints the command line. > > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev > My result from (IP-forward flow hitting CPU 0): $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0 So, what does this tell me?: C0 BAD Bad_Speculation: 0.00 % [ 5.50%] C0 BE Backend_Bound: 100.00 % [ 5.50%] C0 BE/Mem Backend_Bound.Memory_Bound: 53.06 % [ 5.50%] C0 BE/Core Backend_Bound.Core_Bound: 46.94 % [ 5.50%] C0-T0 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [ 5.50%] C0-T0 BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 54.51 % [ 5.50%] C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 20.99 % [ 5.60%] C0-T0 CPU utilization: 1.00 CPUs [100.00%] C0-T1 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [ 5.50%] C0-T1 CPU utilization: 1.00 CPUs [100.00%] Unfortunately the perf command it gives me fails with: "invalid or unsupported event". Perf command: perf record -g -e cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp -C 0,4 -a > However frontend issues are difficult to sample, as they happen very far > away from instruction retirement where the sampling happens. So you may > have large skid and the sampling points may be far away. Skylake has new > special FRONTEND_* PEBS events for this, but before it was often difficult. This testlab CPU is i7-4790K @ 4.00GHz. Maybe I should get a Skylake... p.s. thanks for your pmu-tools[1], even-though I don't know how to use most of them ;-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer [1] https://github.com/andikleen/pmu-tools -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by kanga.kvack.org (Postfix) with ESMTP id 6227382FAD for ; Mon, 5 Oct 2015 19:53:42 -0400 (EDT) Received: by pacfv12 with SMTP id fv12so194391963pac.2 for ; Mon, 05 Oct 2015 16:53:42 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id id10si44011398pbc.42.2015.10.05.16.53.41 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Oct 2015 16:53:41 -0700 (PDT) Date: Tue, 6 Oct 2015 01:53:37 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151006015337.0fab1547@redhat.com> In-Reply-To: <20151005212639.35932b6c@redhat.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: Cc: linux-mm@kvack.org, Christoph Lameter , netdev@vger.kernel.org, Andi Kleen , Arnaldo Carvalho de Melo , brouer@redhat.com On Mon, 5 Oct 2015 21:26:39 +0200 Jesper Dangaard Brouer wrote: > My only problem left, is I want a perf measurement that pinpoint these > kind of spots. The difference in L1-icache-load-misses were significant > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > different perf events without being able to pinpoint the location (even > though I know the spot now). Even tried Andi's ocperf.py... maybe he > will know what event I should try? Using: 'ocperf.py -e icache_misses' and looking closer at the perf annotate and considering "skid" I think I can see the icache misses happening in the end of the function, due to the UD2 inst. Annotation of kmem_cache_free_bulk (last/end of func): =E2=94=8217b: test %r12,%r12 =E2=94=82 =E2=86=91 jne 2e =E2=94=82184: pop %rbx =E2=94=82 pop %r12 =E2=94=82 pop %r13 =E2=94=82 pop %r14 =E2=94=82 pop %r15 =E2=94=82 pop %rbp =E2=94=82 =E2=86=90 retq 8.57 =E2=94=8218f: mov 0x30(%rdx),%rdx 5.71 =E2=94=82 =E2=86=91 jmp 116 =E2=94=82195: ud2 2.86 =E2=94=82197: mov %rdi,%rsi =E2=94=82 mov %r11d,%r8d =E2=94=82 mov %r10,%rcx =E2=94=82 mov %rbx,%rdx =E2=94=82 mov %r15,%rdi =E2=94=82 =E2=86=92 callq __slab_free =E2=94=82 =E2=86=91 jmp 17b 2.86 =E2=94=821ad: mov 0x30(%rdi),%rdi =E2=94=82 =E2=86=91 jmpq 99 --=20 Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f181.google.com (mail-ig0-f181.google.com [209.85.213.181]) by kanga.kvack.org (Postfix) with ESMTP id E7B396B0038 for ; Wed, 7 Oct 2015 06:39:24 -0400 (EDT) Received: by igcrk20 with SMTP id rk20so105692381igc.1 for ; Wed, 07 Oct 2015 03:39:24 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 67si26799746iog.210.2015.10.07.03.39.24 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 07 Oct 2015 03:39:24 -0700 (PDT) Date: Wed, 7 Oct 2015 12:39:19 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151007123919.27a9c823@redhat.com> In-Reply-To: <20151005212639.35932b6c@redhat.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , netdev@vger.kernel.org, Arnaldo Carvalho de Melo , brouer@redhat.com On Mon, 5 Oct 2015 21:26:39 +0200 Jesper Dangaard Brouer wrote: > On Fri, 2 Oct 2015 14:50:44 -0700 Andrew Morton wrote: > [...] > > > Deleting the BUG altogether sounds the best solution. As long as the > > kernel crashes in some manner, we'll be able to work out what happened. > > And it's cant-happen anyway, isn't it? > > To me WARN_ON() seems like a good "documentation" if it does not hurt > performance. I don't think removing the WARN_ON() will improve > performance, but I'm willing to actually test if it matters. I tested removing BUG/WARN_ON altogether, and it gives slightly worse performance. The icache-misses only increase approx 14% (not 112% as before). This, I'm willing to attribute to some code alignment issue. Thus, let us just keep the WARN_ON() and move along. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f175.google.com (mail-ig0-f175.google.com [209.85.213.175]) by kanga.kvack.org (Postfix) with ESMTP id 725436B0038 for ; Wed, 7 Oct 2015 08:31:25 -0400 (EDT) Received: by igxx6 with SMTP id x6so13100250igx.1 for ; Wed, 07 Oct 2015 05:31:25 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id x9si2262386igl.12.2015.10.07.05.31.24 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 07 Oct 2015 05:31:24 -0700 (PDT) Date: Wed, 7 Oct 2015 14:31:20 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151007143120.7068416d@redhat.com> In-Reply-To: <20151006010703.09e2f0ff@redhat.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> <20151005212045.GG26924@tassilo.jf.intel.com> <20151006010703.09e2f0ff@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andi Kleen Cc: linux-mm@kvack.org, netdev@vger.kernel.org, Arnaldo Carvalho de Melo , brouer@redhat.com On Tue, 6 Oct 2015 01:07:03 +0200 Jesper Dangaard Brouer wrote: > (trimmed Cc list a little) > > On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen wrote: > > > > My only problem left, is I want a perf measurement that pinpoint these > > > kind of spots. The difference in L1-icache-load-misses were significant > > > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > > > different perf events without being able to pinpoint the location (even > > > though I know the spot now). Even tried Andi's ocperf.py... maybe he > > > will know what event I should try? > > > > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the > > bottle neck is and what to sample for if there is a suitable event and > > even prints the command line. > > > > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev > > > > My result from (IP-forward flow hitting CPU 0): > $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0 > > So, what does this tell me?: > > C0 BAD Bad_Speculation: 0.00 % [ 5.50%] > C0 BE Backend_Bound: 100.00 % [ 5.50%] > C0 BE/Mem Backend_Bound.Memory_Bound: 53.06 % [ 5.50%] > C0 BE/Core Backend_Bound.Core_Bound: 46.94 % [ 5.50%] > C0-T0 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [ 5.50%] > C0-T0 BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 54.51 % [ 5.50%] > C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 20.99 % [ 5.60%] > C0-T0 CPU utilization: 1.00 CPUs [100.00%] > C0-T1 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [ 5.50%] > C0-T1 CPU utilization: 1.00 CPUs [100.00%] Reading: https://github.com/andikleen/pmu-tools/wiki/toplev-manual Helped me understand most of above. My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have limited "Frontend" support. E.g. # perf record -g -a -e stalled-cycles-frontend Error: The stalled-cycles-frontend event is not supported. And AFAIK icache misses are part of "frontend". > Unfortunately the perf command it gives me fails with: > "invalid or unsupported event". > > Perf command: > > sudo ./ocperf.py record -g -e \ cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,\ cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\ cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,\ cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp \ -C 0,4 -a I fixed the problem with this perf command by removing the ":pp" part. Perhaps your tool need to fix that? A working command line looks like this: sudo ./ocperf.py record -g -e \ cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,\ cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\ cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT,period=2000003/pp,\ cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB,period=100003/pp \ -C 0,4 -a -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f48.google.com (mail-pa0-f48.google.com [209.85.220.48]) by kanga.kvack.org (Postfix) with ESMTP id 5AD6F6B0038 for ; Wed, 7 Oct 2015 09:36:24 -0400 (EDT) Received: by padhy16 with SMTP id hy16so22012426pad.1 for ; Wed, 07 Oct 2015 06:36:24 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id vb4si18174790pbc.53.2015.10.07.06.36.23 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 07 Oct 2015 06:36:23 -0700 (PDT) Date: Wed, 7 Oct 2015 10:36:19 -0300 From: Arnaldo Carvalho de Melo Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151007133619.GR12682@redhat.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> <20151005212045.GG26924@tassilo.jf.intel.com> <20151006010703.09e2f0ff@redhat.com> <20151007143120.7068416d@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151007143120.7068416d@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: Andi Kleen , linux-mm@kvack.org, netdev@vger.kernel.org Em Wed, Oct 07, 2015 at 02:31:20PM +0200, Jesper Dangaard Brouer escreveu: > On Tue, 6 Oct 2015 01:07:03 +0200 > Jesper Dangaard Brouer wrote: > > (trimmed Cc list a little) > > > > On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen wrote: > > > > > > My only problem left, is I want a perf measurement that pinpoint these > > > > kind of spots. The difference in L1-icache-load-misses were significant > > > > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > > > > different perf events without being able to pinpoint the location (even > > > > though I know the spot now). Even tried Andi's ocperf.py... maybe he > > > > will know what event I should try? > > > > > > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the > > > bottle neck is and what to sample for if there is a suitable event and > > > even prints the command line. > > > > > > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev > > > > > > > My result from (IP-forward flow hitting CPU 0): > > $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0 > > > > So, what does this tell me?: > > > > C0 BAD Bad_Speculation: 0.00 % [ 5.50%] > > C0 BE Backend_Bound: 100.00 % [ 5.50%] > > C0 BE/Mem Backend_Bound.Memory_Bound: 53.06 % [ 5.50%] > > C0 BE/Core Backend_Bound.Core_Bound: 46.94 % [ 5.50%] > > C0-T0 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [ 5.50%] > > C0-T0 BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 54.51 % [ 5.50%] > > C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 20.99 % [ 5.60%] > > C0-T0 CPU utilization: 1.00 CPUs [100.00%] > > C0-T1 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [ 5.50%] > > C0-T1 CPU utilization: 1.00 CPUs [100.00%] > > Reading: https://github.com/andikleen/pmu-tools/wiki/toplev-manual > Helped me understand most of above. > > My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have > limited "Frontend" support. E.g. > > # perf record -g -a -e stalled-cycles-frontend > Error: > The stalled-cycles-frontend event is not supported. > > And AFAIK icache misses are part of "frontend". > > > > Unfortunately the perf command it gives me fails with: > > "invalid or unsupported event". > > > > Perf command: > > > > sudo ./ocperf.py record -g -e \ > cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,\ > cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\ > cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,\ > cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp \ > -C 0,4 -a > > I fixed the problem with this perf command by removing the ":pp" part. > Perhaps your tool need to fix that? > > A working command line looks like this: > > sudo ./ocperf.py record -g -e \ > cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,\ > cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\ > cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT,period=2000003/pp,\ > cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB,period=100003/pp \ > -C 0,4 -a There is a recent patch that may help here, see below, but maybe its just a matter of removing that :pp, as it ends with a /pp anyway, no need to state that twice :) With the patch below all those /pp would be replaced with /P. - Arnaldo https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/tools/perf?id=7f94af7a489fada17d28cc60e8f4409ce216bd6d ---------------------------------------------------------------------- perf tools: Introduce 'P' modifier to request max precision The 'P' will cause the event to get maximum possible detected precise level. Following record: $ perf record -e cycles:P ... will detect maximum precise level for 'cycles' event and use it. ---------------------------------------------------------------------- - Arnaldo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f48.google.com (mail-pa0-f48.google.com [209.85.220.48]) by kanga.kvack.org (Postfix) with ESMTP id CB31F6B0254 for ; Wed, 7 Oct 2015 11:45:03 -0400 (EDT) Received: by pablk4 with SMTP id lk4so24841263pab.3 for ; Wed, 07 Oct 2015 08:45:03 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id pz3si58673332pbb.48.2015.10.07.08.45.02 for ; Wed, 07 Oct 2015 08:45:03 -0700 (PDT) Date: Wed, 7 Oct 2015 08:44:56 -0700 From: Andi Kleen Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151007154456.GJ26924@tassilo.jf.intel.com> References: <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> <20151005212045.GG26924@tassilo.jf.intel.com> <20151006010703.09e2f0ff@redhat.com> <20151007143120.7068416d@redhat.com> <20151007133619.GR12682@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151007133619.GR12682@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Arnaldo Carvalho de Melo Cc: Jesper Dangaard Brouer , linux-mm@kvack.org, netdev@vger.kernel.org > There is a recent patch that may help here, see below, but maybe its > just a matter of removing that :pp, as it ends with a /pp anyway, no > need to state that twice :) Yes the extra :pp was a regression in toplev. I fixed it now. Thanks. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f50.google.com (mail-pa0-f50.google.com [209.85.220.50]) by kanga.kvack.org (Postfix) with ESMTP id 815856B0038 for ; Wed, 7 Oct 2015 12:06:56 -0400 (EDT) Received: by padhy16 with SMTP id hy16so25378323pad.1 for ; Wed, 07 Oct 2015 09:06:56 -0700 (PDT) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id qp7si58800410pbc.93.2015.10.07.09.06.55 for ; Wed, 07 Oct 2015 09:06:55 -0700 (PDT) Date: Wed, 7 Oct 2015 09:06:52 -0700 From: Andi Kleen Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Message-ID: <20151007160652.GK26924@tassilo.jf.intel.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> <20151005212045.GG26924@tassilo.jf.intel.com> <20151006010703.09e2f0ff@redhat.com> <20151007143120.7068416d@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151007143120.7068416d@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, netdev@vger.kernel.org, Arnaldo Carvalho de Melo > My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have > limited "Frontend" support. E.g. > > # perf record -g -a -e stalled-cycles-frontend > Error: > The stalled-cycles-frontend event is not supported. > > And AFAIK icache misses are part of "frontend". Ignore stalled-cycles-frontend. It is very unreliable and has never worked right. toplev gives you much more reliable output. -Andi -- ak@linux.intel.com -- Speaking for myself only -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 8CA366B0253 for ; Wed, 14 Oct 2015 01:14:42 -0400 (EDT) Received: by pacex6 with SMTP id ex6so42967172pac.3 for ; Tue, 13 Oct 2015 22:14:42 -0700 (PDT) Received: from lgeamrelo13.lge.com (LGEAMRELO13.lge.com. [156.147.23.53]) by mx.google.com with ESMTPS id co2si10362862pbc.217.2015.10.13.22.14.40 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 13 Oct 2015 22:14:41 -0700 (PDT) Date: Wed, 14 Oct 2015 14:15:25 +0900 From: Joonsoo Kim Subject: Re: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Message-ID: <20151014051524.GA29286@js1304-P5Q-DELUXE> References: <20150929154605.14465.98995.stgit@canyon> <20150929154822.14465.50207.stgit@canyon> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150929154822.14465.50207.stgit@canyon> Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes On Tue, Sep 29, 2015 at 05:48:26PM +0200, Jesper Dangaard Brouer wrote: > This change focus on improving the speed of object freeing in the > "slowpath" of kmem_cache_free_bulk. > > The calls slab_free (fastpath) and __slab_free (slowpath) have been > extended with support for bulk free, which amortize the overhead of > the (locked) cmpxchg_double. > > To use the new bulking feature, we build what I call a detached > freelist. The detached freelist takes advantage of three properties: > > 1) the free function call owns the object that is about to be freed, > thus writing into this memory is synchronization-free. > > 2) many freelist's can co-exist side-by-side in the same slab-page > each with a separate head pointer. > > 3) it is the visibility of the head pointer that needs synchronization. > > Given these properties, the brilliant part is that the detached > freelist can be constructed without any need for synchronization. The > freelist is constructed directly in the page objects, without any > synchronization needed. The detached freelist is allocated on the > stack of the function call kmem_cache_free_bulk. Thus, the freelist > head pointer is not visible to other CPUs. > > All objects in a SLUB freelist must belong to the same slab-page. > Thus, constructing the detached freelist is about matching objects > that belong to the same slab-page. The bulk free array is scanned is > a progressive manor with a limited look-ahead facility. > > Kmem debug support is handled in call of slab_free(). > > Notice kmem_cache_free_bulk no longer need to disable IRQs. This > only slowed down single free bulk with approx 3 cycles. > > > Performance data: > Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz > > SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns > > To get stable and comparable numbers, the kernel have been booted with > "slab_merge" (this also improve performance for larger bulk sizes). > > Performance data, compared against fallback bulking: > > bulk - fallback bulk - improvement with this patch > 1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0% > 2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5% > 3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6% > 4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5% > 8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0% > 16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3% > 30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3% > 32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0% > 34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0% > 48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7% > 64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0% > 128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0% > 158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7% > 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4% > > Performance data, compared current in-kernel bulking: > > bulk - curr in-kernel - improvement with this patch > 1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5% > 2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1% > 3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5% > 4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1% > 8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9% > 16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6% > 30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% > 32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% > 34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5% > 48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0% > 64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2% > 128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9% > 158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0% > 250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0% > > Performance with normal SLUB merging is significantly slower for > larger bulking. This is believed to (primarily) be an effect of not > having to share the per-CPU data-structures, as tuning per-CPU size > can achieve similar performance. > > bulk - slab_nomerge - normal SLUB merge > 1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0 > 2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0 > 3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0 > 4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0 > 8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0 > 16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0 > 30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5 > 32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4 > 34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1 > 48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1 > 64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28 > 128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30 > 158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29 > 250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19 > > Joint work with Alexander Duyck. > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c > > Signed-off-by: Jesper Dangaard Brouer > Signed-off-by: Alexander Duyck > Acked-by: Christoph Lameter > --- > mm/slub.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 78 insertions(+), 30 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 7c2abc33fd4e..53500f3b70ab 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2812,44 +2812,92 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > } > EXPORT_SYMBOL(kmem_cache_free); > > -/* Note that interrupts must be enabled when calling this function. */ > -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > -{ > - struct kmem_cache_cpu *c; > +struct detached_freelist { > struct page *page; > - int i; > + void *tail; > + void *freelist; > + int cnt; > +}; > > - local_irq_disable(); > - c = this_cpu_ptr(s->cpu_slab); > +/* > + * This function progressively scans the array with free objects (with > + * a limited look ahead) and extract objects belonging to the same > + * page. It builds a detached freelist directly within the given > + * page/objects. This can happen without any need for > + * synchronization, because the objects are owned by running process. > + * The freelist is build up as a single linked list in the objects. > + * The idea is, that this detached freelist can then be bulk > + * transferred to the real freelist(s), but only requiring a single > + * synchronization primitive. Look ahead in the array is limited due > + * to performance reasons. > + */ > +static int build_detached_freelist(struct kmem_cache *s, size_t size, > + void **p, struct detached_freelist *df) > +{ > + size_t first_skipped_index = 0; > + int lookahead = 3; > + void *object; > > - for (i = 0; i < size; i++) { > - void *object = p[i]; > + /* Always re-init detached_freelist */ > + df->page = NULL; > > - BUG_ON(!object); > - /* kmem cache debug support */ > - s = cache_from_obj(s, object); > - if (unlikely(!s)) > - goto exit; > - slab_free_hook(s, object); > + do { > + object = p[--size]; > + } while (!object && size); > > - page = virt_to_head_page(object); > + if (!object) > + return 0; > > - if (c->page == page) { > - /* Fastpath: local CPU free */ > - set_freepointer(s, object, c->freelist); > - c->freelist = object; > - } else { > - c->tid = next_tid(c->tid); > - local_irq_enable(); > - /* Slowpath: overhead locked cmpxchg_double_slab */ > - __slab_free(s, page, object, object, 1, _RET_IP_); > - local_irq_disable(); > - c = this_cpu_ptr(s->cpu_slab); > + /* Start new detached freelist */ > + set_freepointer(s, object, NULL); > + df->page = virt_to_head_page(object); > + df->tail = object; > + df->freelist = object; > + p[size] = NULL; /* mark object processed */ > + df->cnt = 1; Hello, Jesper. AFAIK, it is uncommon to clear pointer to object in argument array. At least, it is better to comment it on somewhere. Or, how about removing lookahead facility? Does it have real benefit? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f170.google.com (mail-ig0-f170.google.com [209.85.213.170]) by kanga.kvack.org (Postfix) with ESMTP id BA3B76B0038 for ; Wed, 21 Oct 2015 03:57:17 -0400 (EDT) Received: by igbkq10 with SMTP id kq10so88868631igb.0 for ; Wed, 21 Oct 2015 00:57:17 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 96si6366476iom.28.2015.10.21.00.57.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Oct 2015 00:57:17 -0700 (PDT) Date: Wed, 21 Oct 2015 09:57:09 +0200 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Message-ID: <20151021095709.167e58d2@redhat.com> In-Reply-To: <20151014051524.GA29286@js1304-P5Q-DELUXE> References: <20150929154605.14465.98995.stgit@canyon> <20150929154822.14465.50207.stgit@canyon> <20151014051524.GA29286@js1304-P5Q-DELUXE> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Joonsoo Kim Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , brouer@redhat.com On Wed, 14 Oct 2015 14:15:25 +0900 Joonsoo Kim wrote: > On Tue, Sep 29, 2015 at 05:48:26PM +0200, Jesper Dangaard Brouer wrote: > > This change focus on improving the speed of object freeing in the > > "slowpath" of kmem_cache_free_bulk. > > > > The calls slab_free (fastpath) and __slab_free (slowpath) have been > > extended with support for bulk free, which amortize the overhead of > > the (locked) cmpxchg_double. > > > > To use the new bulking feature, we build what I call a detached > > freelist. The detached freelist takes advantage of three properties: > > > > 1) the free function call owns the object that is about to be freed, > > thus writing into this memory is synchronization-free. > > > > 2) many freelist's can co-exist side-by-side in the same slab-page > > each with a separate head pointer. > > > > 3) it is the visibility of the head pointer that needs synchronization. > > > > Given these properties, the brilliant part is that the detached > > freelist can be constructed without any need for synchronization. The > > freelist is constructed directly in the page objects, without any > > synchronization needed. The detached freelist is allocated on the > > stack of the function call kmem_cache_free_bulk. Thus, the freelist > > head pointer is not visible to other CPUs. > > > > All objects in a SLUB freelist must belong to the same slab-page. > > Thus, constructing the detached freelist is about matching objects > > that belong to the same slab-page. The bulk free array is scanned is > > a progressive manor with a limited look-ahead facility. [...] > Hello, Jesper. > > AFAIK, it is uncommon to clear pointer to object in argument array. > At least, it is better to comment it on somewhere. In this case, I think clearing the array is a good thing, as using/referencing objects after they have been free'ed is a bug (which can be hard to detect). > Or, how about removing lookahead facility? Does it have real benefit? In my earlier patch series I had a version with and without lookahead facility. Just so I could benchmark the difference. With Alex'es help we/I tuned the code with the lookahead feature to be just as fast. Thus, I merged the two patches. (Also did testing for worstcase [1]) I do wonder if the lookahead have any real benefit. In micro benchmarking it might be "just-as-fast", but I do suspect (just the code size increase) it can affect real use-cases... Should we remove it? [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test03.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id 6515C82F64 for ; Thu, 5 Nov 2015 00:09:18 -0500 (EST) Received: by pasz6 with SMTP id z6so78221502pas.2 for ; Wed, 04 Nov 2015 21:09:18 -0800 (PST) Received: from lgeamrelo13.lge.com (LGEAMRELO13.lge.com. [156.147.23.53]) by mx.google.com with ESMTPS id h4si7286347pat.127.2015.11.04.21.09.16 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 04 Nov 2015 21:09:17 -0800 (PST) Date: Thu, 5 Nov 2015 14:09:25 +0900 From: Joonsoo Kim Subject: Re: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Message-ID: <20151105050924.GD20374@js1304-P5Q-DELUXE> References: <20150929154605.14465.98995.stgit@canyon> <20150929154822.14465.50207.stgit@canyon> <20151014051524.GA29286@js1304-P5Q-DELUXE> <20151021095709.167e58d2@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151021095709.167e58d2@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes On Wed, Oct 21, 2015 at 09:57:09AM +0200, Jesper Dangaard Brouer wrote: > On Wed, 14 Oct 2015 14:15:25 +0900 > Joonsoo Kim wrote: > > > On Tue, Sep 29, 2015 at 05:48:26PM +0200, Jesper Dangaard Brouer wrote: > > > This change focus on improving the speed of object freeing in the > > > "slowpath" of kmem_cache_free_bulk. > > > > > > The calls slab_free (fastpath) and __slab_free (slowpath) have been > > > extended with support for bulk free, which amortize the overhead of > > > the (locked) cmpxchg_double. > > > > > > To use the new bulking feature, we build what I call a detached > > > freelist. The detached freelist takes advantage of three properties: > > > > > > 1) the free function call owns the object that is about to be freed, > > > thus writing into this memory is synchronization-free. > > > > > > 2) many freelist's can co-exist side-by-side in the same slab-page > > > each with a separate head pointer. > > > > > > 3) it is the visibility of the head pointer that needs synchronization. > > > > > > Given these properties, the brilliant part is that the detached > > > freelist can be constructed without any need for synchronization. The > > > freelist is constructed directly in the page objects, without any > > > synchronization needed. The detached freelist is allocated on the > > > stack of the function call kmem_cache_free_bulk. Thus, the freelist > > > head pointer is not visible to other CPUs. > > > > > > All objects in a SLUB freelist must belong to the same slab-page. > > > Thus, constructing the detached freelist is about matching objects > > > that belong to the same slab-page. The bulk free array is scanned is > > > a progressive manor with a limited look-ahead facility. > [...] > > > > Hello, Jesper. > > > > AFAIK, it is uncommon to clear pointer to object in argument array. > > At least, it is better to comment it on somewhere. > > In this case, I think clearing the array is a good thing, as > using/referencing objects after they have been free'ed is a bug (which > can be hard to detect). Okay. > > > Or, how about removing lookahead facility? Does it have real benefit? > > In my earlier patch series I had a version with and without lookahead > facility. Just so I could benchmark the difference. With Alex'es help > we/I tuned the code with the lookahead feature to be just as fast. > Thus, I merged the two patches. (Also did testing for worstcase [1]) > > I do wonder if the lookahead have any real benefit. In micro > benchmarking it might be "just-as-fast", but I do suspect (just the code > size increase) it can affect real use-cases... Should we remove it? > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test03.c If it's not implemented yet, I would say that starting with simple one first. But, now, we already have well implemented one so we don't need to remove it. :) Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [PATCH 1/7] slub: create new ___slab_alloc function that can be called with irqs disabled Date: Mon, 28 Sep 2015 14:26:09 +0200 Message-ID: <20150928122609.15409.30765.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim To: linux-mm@kvack.org, Andrew Morton Return-path: Received: from mx1.redhat.com ([209.132.183.28]:50163 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757134AbbI1M0L (ORCPT ); Mon, 28 Sep 2015 08:26:11 -0400 In-Reply-To: <20150928122444.15409.10498.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: From: Christoph Lameter NOTICE: Accepted by AKPM http://ozlabs.org/~akpm/mmots/broken-out/slub-create-new-___slab_alloc-function-that-can-be-called-with-irqs-disabled.patch Bulk alloc needs a function like that because it enables interrupts before calling __slab_alloc which promptly disables them again using the expensive local_irq_save(). Signed-off-by: Christoph Lameter Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 44 +++++++++++++++++++++++++++++--------------- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index f614b5dc396b..02cfb3a5983e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2298,23 +2298,15 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page) * And if we were unable to get a new slab from the partial slab lists then * we need to allocate a new slab. This is the slowest path since it involves * a call to the page allocator and the setup of a new slab. + * + * Version of __slab_alloc to use when we know that interrupts are + * already disabled (which is the case for bulk allocation). */ -static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, +static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr, struct kmem_cache_cpu *c) { void *freelist; struct page *page; - unsigned long flags; - - local_irq_save(flags); -#ifdef CONFIG_PREEMPT - /* - * We may have been preempted and rescheduled on a different - * cpu before disabling interrupts. Need to reload cpu area - * pointer. - */ - c = this_cpu_ptr(s->cpu_slab); -#endif page = c->page; if (!page) @@ -2372,7 +2364,6 @@ load_freelist: VM_BUG_ON(!c->page->frozen); c->freelist = get_freepointer(s, freelist); c->tid = next_tid(c->tid); - local_irq_restore(flags); return freelist; new_slab: @@ -2389,7 +2380,6 @@ new_slab: if (unlikely(!freelist)) { slab_out_of_memory(s, gfpflags, node); - local_irq_restore(flags); return NULL; } @@ -2405,11 +2395,35 @@ new_slab: deactivate_slab(s, page, get_freepointer(s, freelist)); c->page = NULL; c->freelist = NULL; - local_irq_restore(flags); return freelist; } /* + * Another one that disabled interrupt and compensates for possible + * cpu changes by refetching the per cpu area pointer. + */ +static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, + unsigned long addr, struct kmem_cache_cpu *c) +{ + void *p; + unsigned long flags; + + local_irq_save(flags); +#ifdef CONFIG_PREEMPT + /* + * We may have been preempted and rescheduled on a different + * cpu before disabling interrupts. Need to reload cpu area + * pointer. + */ + c = this_cpu_ptr(s->cpu_slab); +#endif + + p = ___slab_alloc(s, gfpflags, node, addr, c); + local_irq_restore(flags); + return p; +} + +/* * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc) * have the fastpath folded into their functions. So no function call * overhead for requests that can be satisfied on the fastpath. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Date: Mon, 28 Sep 2015 14:26:19 +0200 Message-ID: <20150928122619.15409.68763.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim To: linux-mm@kvack.org, Andrew Morton Return-path: Received: from mx1.redhat.com ([209.132.183.28]:37588 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757134AbbI1M0V (ORCPT ); Mon, 28 Sep 2015 08:26:21 -0400 In-Reply-To: <20150928122444.15409.10498.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: The #ifdef of CONFIG_SLUB_DEBUG is located very far from the associated #else. For readability mark it with a comment. Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 024eed32da2c..1cf98d89546d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1202,7 +1202,7 @@ unsigned long kmem_cache_flags(unsigned long object_size, return flags; } -#else +#else /* !CONFIG_SLUB_DEBUG */ static inline void setup_object_debug(struct kmem_cache *s, struct page *page, void *object) {} From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [PATCH 4/7] slab: implement bulking for SLAB allocator Date: Mon, 28 Sep 2015 14:26:24 +0200 Message-ID: <20150928122624.15409.23038.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim To: linux-mm@kvack.org, Andrew Morton Return-path: Received: from mx1.redhat.com ([209.132.183.28]:51664 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757134AbbI1M00 (ORCPT ); Mon, 28 Sep 2015 08:26:26 -0400 In-Reply-To: <20150928122444.15409.10498.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: Implement a basic approach of bulking in the SLAB allocator. Simply use local_irq_{disable,enable} and call single alloc/free in a loop. This simple implementation approach is surprising fast. Notice the normal SLAB fastpath is: 96 cycles (24.119 ns). Below table show that single object bulking only takes 42 cycles. This can be explained by the bulk APIs requirement to be called from a known interrupt context, that is with interrupts enabled. This allow us to avoid the expensive (37 cycles) local_irq_{save,restore}, and instead use the much faster (7 cycles) local_irq_{disable,restore}. Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz: bulk - Current - simple SLAB bulk implementation 1 - 115 cycles(tsc) 28.812 ns - 42 cycles(tsc) 10.715 ns - improved 63.5% 2 - 103 cycles(tsc) 25.956 ns - 27 cycles(tsc) 6.985 ns - improved 73.8% 3 - 101 cycles(tsc) 25.336 ns - 22 cycles(tsc) 5.733 ns - improved 78.2% 4 - 100 cycles(tsc) 25.147 ns - 21 cycles(tsc) 5.319 ns - improved 79.0% 8 - 98 cycles(tsc) 24.616 ns - 18 cycles(tsc) 4.620 ns - improved 81.6% 16 - 97 cycles(tsc) 24.408 ns - 17 cycles(tsc) 4.344 ns - improved 82.5% 30 - 98 cycles(tsc) 24.641 ns - 16 cycles(tsc) 4.202 ns - improved 83.7% 32 - 98 cycles(tsc) 24.607 ns - 16 cycles(tsc) 4.199 ns - improved 83.7% 34 - 98 cycles(tsc) 24.605 ns - 18 cycles(tsc) 4.579 ns - improved 81.6% 48 - 97 cycles(tsc) 24.463 ns - 17 cycles(tsc) 4.405 ns - improved 82.5% 64 - 97 cycles(tsc) 24.370 ns - 17 cycles(tsc) 4.384 ns - improved 82.5% 128 - 99 cycles(tsc) 24.763 ns - 19 cycles(tsc) 4.755 ns - improved 80.8% 158 - 98 cycles(tsc) 24.708 ns - 18 cycles(tsc) 4.723 ns - improved 81.6% 250 - 101 cycles(tsc) 25.342 ns - 20 cycles(tsc) 5.035 ns - improved 80.2% Also notice how well bulking maintains the performance when the bulk size increases (which is a soar spot for the SLUB allocator). Increasing the bulk size further: 20 cycles(tsc) 5.214 ns (bulk: 512) 30 cycles(tsc) 7.734 ns (bulk: 768) 40 cycles(tsc) 10.244 ns (bulk:1024) 72 cycles(tsc) 18.049 ns (bulk:2048) 90 cycles(tsc) 22.585 ns (bulk:4096) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c Signed-off-by: Jesper Dangaard Brouer --- mm/slab.c | 87 +++++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 62 insertions(+), 25 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index c77ebe6cc87c..21da6b1ccae3 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3234,11 +3234,15 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags) #endif /* CONFIG_NUMA */ static __always_inline void * -slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller) +slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller, + bool irq_off_needed) { unsigned long save_flags; void *objp; + /* Compiler need to remove irq_off_needed branch statements */ + BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed)); + flags &= gfp_allowed_mask; lockdep_trace_alloc(flags); @@ -3249,9 +3253,11 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller) cachep = memcg_kmem_get_cache(cachep, flags); cache_alloc_debugcheck_before(cachep, flags); - local_irq_save(save_flags); + if (irq_off_needed) + local_irq_save(save_flags); objp = __do_cache_alloc(cachep, flags); - local_irq_restore(save_flags); + if (irq_off_needed) + local_irq_restore(save_flags); objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller); kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags, flags); @@ -3407,7 +3413,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp, */ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) { - void *ret = slab_alloc(cachep, flags, _RET_IP_); + void *ret = slab_alloc(cachep, flags, _RET_IP_, true); trace_kmem_cache_alloc(_RET_IP_, ret, cachep->object_size, cachep->size, flags); @@ -3416,16 +3422,23 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) } EXPORT_SYMBOL(kmem_cache_alloc); -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) -{ - __kmem_cache_free_bulk(s, size, p); -} -EXPORT_SYMBOL(kmem_cache_free_bulk); - +/* Note that interrupts must be enabled when calling this function. */ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, - void **p) + void **p) { - return __kmem_cache_alloc_bulk(s, flags, size, p); + size_t i; + + local_irq_disable(); + for (i = 0; i < size; i++) { + void *x = p[i] = slab_alloc(s, flags, _RET_IP_, false); + + if (!x) { + __kmem_cache_free_bulk(s, i, p); + return false; + } + } + local_irq_enable(); + return true; } EXPORT_SYMBOL(kmem_cache_alloc_bulk); @@ -3435,7 +3448,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size) { void *ret; - ret = slab_alloc(cachep, flags, _RET_IP_); + ret = slab_alloc(cachep, flags, _RET_IP_, true); trace_kmalloc(_RET_IP_, ret, size, cachep->size, flags); @@ -3526,7 +3539,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags, cachep = kmalloc_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(cachep))) return cachep; - ret = slab_alloc(cachep, flags, caller); + ret = slab_alloc(cachep, flags, caller, true); trace_kmalloc(caller, ret, size, cachep->size, flags); @@ -3546,32 +3559,56 @@ void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller) } EXPORT_SYMBOL(__kmalloc_track_caller); -/** - * kmem_cache_free - Deallocate an object - * @cachep: The cache the allocation was from. - * @objp: The previously allocated object. - * - * Free an object which was previously allocated from this - * cache. - */ -void kmem_cache_free(struct kmem_cache *cachep, void *objp) +/* Caller is responsible for disabling local IRQs */ +static __always_inline void __kmem_cache_free(struct kmem_cache *cachep, + void *objp, bool irq_off_needed) { unsigned long flags; + + /* Compiler need to remove irq_off_needed branch statements */ + BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed)); + cachep = cache_from_obj(cachep, objp); if (!cachep) return; - local_irq_save(flags); + if (irq_off_needed) + local_irq_save(flags); debug_check_no_locks_freed(objp, cachep->object_size); if (!(cachep->flags & SLAB_DEBUG_OBJECTS)) debug_check_no_obj_freed(objp, cachep->object_size); __cache_free(cachep, objp, _RET_IP_); - local_irq_restore(flags); + if (irq_off_needed) + local_irq_restore(flags); +} +/** + * kmem_cache_free - Deallocate an object + * @cachep: The cache the allocation was from. + * @objp: The previously allocated object. + * + * Free an object which was previously allocated from this + * cache. + */ +void kmem_cache_free(struct kmem_cache *cachep, void *objp) +{ + __kmem_cache_free(cachep, objp, true); trace_kmem_cache_free(_RET_IP_, objp); } EXPORT_SYMBOL(kmem_cache_free); +/* Note that interrupts must be enabled when calling this function. */ +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +{ + size_t i; + + local_irq_disable(); + for (i = 0; i < size; i++) + __kmem_cache_free(s, p[i], false); + local_irq_enable(); +} +EXPORT_SYMBOL(kmem_cache_free_bulk); + /** * kfree - free previously allocated memory * @objp: pointer returned by kmalloc. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [PATCH 5/7] slub: support for bulk free with SLUB freelists Date: Mon, 28 Sep 2015 14:26:29 +0200 Message-ID: <20150928122629.15409.69466.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim To: linux-mm@kvack.org, Andrew Morton Return-path: Received: from mx1.redhat.com ([209.132.183.28]:34461 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757157AbbI1M0b (ORCPT ); Mon, 28 Sep 2015 08:26:31 -0400 In-Reply-To: <20150928122444.15409.10498.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: Make it possible to free a freelist with several objects by extending __slab_free() and slab_free() with two arguments: a freelist_head pointer and objects counter (cnt). If freelist_head pointer is set, then the object is the freelist tail pointer. This allows a freelist with several objects (all within the same slub-page) to be free'ed using a single locked cmpxchg_double in __slab_free() and with an unlocked cmpxchg_double in slab_free(). Object debugging on the free path is also extended to handle these freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if objects don't belong to the same slub-page. These changes are needed for the next patch to bulk free the detached freelists it introduces and constructs. Micro benchmarking showed no performance reduction due to this change, when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexander Duyck --- mm/slub.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 84 insertions(+), 13 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 1cf98d89546d..13b5f53e4840 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) { u8 *p = object; + /* Freepointer not overwritten as SLAB_POISON moved it after object */ if (s->flags & __OBJECT_POISON) { memset(p, POISON_FREE, s->object_size - 1); p[s->object_size - 1] = POISON_END; } + /* + * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then + * freepointer is still safe, as then s->offset equals + * s->inuse and below redzone is after s->object_size and only + * area between s->object_size and s->inuse. + */ if (s->flags & SLAB_RED_ZONE) memset(p + s->object_size, val, s->inuse - s->object_size); } @@ -1063,18 +1070,32 @@ bad: return 0; } +/* Supports checking bulk free of a constructed freelist */ static noinline struct kmem_cache_node *free_debug_processing( - struct kmem_cache *s, struct page *page, void *object, + struct kmem_cache *s, struct page *page, + void *obj_tail, void *freelist_head, int bulk_cnt, unsigned long addr, unsigned long *flags) { struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + void *object; + int cnt = 0; spin_lock_irqsave(&n->list_lock, *flags); slab_lock(page); + /* + * Bulk free of a constructed freelist is indicated by the + * freelist_head pointer being set, else obj_tail is object + * being free'ed + */ + object = freelist_head ? : obj_tail; + if (!check_slab(s, page)) goto fail; +next_object: + cnt++; + if (!check_valid_pointer(s, page, object)) { slab_err(s, page, "Invalid object pointer 0x%p", object); goto fail; @@ -1105,8 +1126,19 @@ static noinline struct kmem_cache_node *free_debug_processing( if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); trace(s, page, object, 0); + /* Freepointer not overwritten by init_object(), SLAB_POISON moved it */ init_object(s, object, SLUB_RED_INACTIVE); + + /* Reached end of constructed freelist yet? */ + if (object != obj_tail) { + object = get_freepointer(s, object); + goto next_object; + } out: + if (cnt != bulk_cnt) + slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n", + bulk_cnt, cnt); + slab_unlock(page); /* * Keep node_lock to preserve integrity @@ -1210,7 +1242,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s, struct page *page, void *object, unsigned long addr) { return 0; } static inline struct kmem_cache_node *free_debug_processing( - struct kmem_cache *s, struct page *page, void *object, + struct kmem_cache *s, struct page *page, + void *obj_tail, void *freelist_head, int bulk_cnt, unsigned long addr, unsigned long *flags) { return NULL; } static inline int slab_pad_check(struct kmem_cache *s, struct page *page) @@ -1306,6 +1339,35 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x) kasan_slab_free(s, x); } +/* Compiler cannot detect that slab_free_freelist_hook() can be + * removed if slab_free_hook() evaluates to nothing. Thus, we need to + * catch all relevant config debug options here. + */ +#if defined(CONFIG_KMEMCHECK) || \ + defined(CONFIG_LOCKDEP) || \ + defined(CONFIG_DEBUG_KMEMLEAK) || \ + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ + defined(CONFIG_KASAN) +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, + void *freelist_head) +{ + /* + * Bulk free of a constructed freelist is indicated by the + * freelist_head pointer being set, else obj_tail is object + * being free'ed + */ + void *object = freelist_head ? : obj_tail; + + do { + slab_free_hook(s, object); + } while ((object != obj_tail) && + (object = get_freepointer(s, object))); +} +#else +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, + void *freelist_head) {} +#endif + static void setup_object(struct kmem_cache *s, struct page *page, void *object) { @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); * So we still attempt to reduce cache line usage. Just take the slab * lock and free the item. If there is no additional partial page * handling required then we can return immediately. + * + * Bulk free of a freelist with several objects (all pointing to the + * same page) possible by specifying freelist_head ptr and object as + * tail ptr, plus objects count (cnt). */ static void __slab_free(struct kmem_cache *s, struct page *page, - void *x, unsigned long addr) + void *x, unsigned long addr, + void *freelist_head, int cnt) { void *prior; void **object = (void *)x; @@ -2595,11 +2662,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page, unsigned long counters; struct kmem_cache_node *n = NULL; unsigned long uninitialized_var(flags); + void *new_freelist = freelist_head ? : x; stat(s, FREE_SLOWPATH); if (kmem_cache_debug(s) && - !(n = free_debug_processing(s, page, x, addr, &flags))) + !(n = free_debug_processing(s, page, x, freelist_head, cnt, + addr, &flags))) return; do { @@ -2612,7 +2681,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, set_freepointer(s, object, prior); new.counters = counters; was_frozen = new.frozen; - new.inuse--; + new.inuse -= cnt; if ((!new.inuse || !prior) && !was_frozen) { if (kmem_cache_has_cpu_partial(s) && !prior) { @@ -2643,7 +2712,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } while (!cmpxchg_double_slab(s, page, prior, counters, - object, new.counters, + new_freelist, new.counters, "__slab_free")); if (likely(!n)) { @@ -2710,13 +2779,15 @@ slab_empty: * with all sorts of special processing. */ static __always_inline void slab_free(struct kmem_cache *s, - struct page *page, void *x, unsigned long addr) + struct page *page, void *x, unsigned long addr, + void *freelist_head, int cnt) { void **object = (void *)x; + void *new_freelist = freelist_head ? : x; struct kmem_cache_cpu *c; unsigned long tid; - slab_free_hook(s, x); + slab_free_freelist_hook(s, x, freelist_head); redo: /* @@ -2740,14 +2811,14 @@ redo: if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, - object, next_tid(tid)))) { + new_freelist, next_tid(tid)))) { note_cmpxchg_failure("slab_free", s, tid); goto redo; } stat(s, FREE_FASTPATH); } else - __slab_free(s, page, x, addr); + __slab_free(s, page, x, addr, freelist_head, cnt); } @@ -2756,7 +2827,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x) s = cache_from_obj(s, x); if (!s) return; - slab_free(s, virt_to_head_page(x), x, _RET_IP_); + slab_free(s, virt_to_head_page(x), x, _RET_IP_, NULL, 1); trace_kmem_cache_free(_RET_IP_, x); } EXPORT_SYMBOL(kmem_cache_free); @@ -2791,7 +2862,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) c->tid = next_tid(c->tid); local_irq_enable(); /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_); + __slab_free(s, page, object, _RET_IP_, NULL, 1); local_irq_disable(); c = this_cpu_ptr(s->cpu_slab); } @@ -3531,7 +3602,7 @@ void kfree(const void *x) __free_kmem_pages(page, compound_order(page)); return; } - slab_free(page->slab_cache, page, object, _RET_IP_); + slab_free(page->slab_cache, page, object, _RET_IP_, NULL, 1); } EXPORT_SYMBOL(kfree); From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() Date: Mon, 28 Sep 2015 14:26:39 +0200 Message-ID: <20150928122639.15409.21583.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim To: linux-mm@kvack.org, Andrew Morton Return-path: Received: from mx1.redhat.com ([209.132.183.28]:37653 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757134AbbI1M0l (ORCPT ); Mon, 28 Sep 2015 08:26:41 -0400 In-Reply-To: <20150928122444.15409.10498.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: For practical use-cases it is beneficial to prefetch the next freelist object in bulk allocation loop. Micro benchmarking show approx 1 cycle change: bulk - prev-patch - this patch 1 - 49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0 2 - 30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1 3 - 23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2 4 - 20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2 8 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 16 - 17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1 30 - 18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1 32 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 34 - 23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1 48 - 21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1 64 - 20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1 128 - 27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0 158 - 30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0 250 - 37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0 Note, benchmark done with slab_nomerge to keep it stable enough for accurate comparison. Signed-off-by: Jesper Dangaard Brouer --- mm/slub.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index c25717ab3b5a..5af75a618b91 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, goto error; c = this_cpu_ptr(s->cpu_slab); + prefetch_freepointer(s, c->freelist); continue; /* goto for-loop */ } @@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, goto error; c->freelist = get_freepointer(s, object); + prefetch_freepointer(s, c->freelist); p[i] = object; /* kmem_cache debug support */ From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist Date: Mon, 28 Sep 2015 14:26:34 +0200 Message-ID: <20150928122634.15409.6956.stgit@canyon> References: <20150928122444.15409.10498.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, Jesper Dangaard Brouer , Alexander Duyck , Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim To: linux-mm@kvack.org, Andrew Morton Return-path: Received: from mx1.redhat.com ([209.132.183.28]:34476 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757134AbbI1M0g (ORCPT ); Mon, 28 Sep 2015 08:26:36 -0400 In-Reply-To: <20150928122444.15409.10498.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: This change focus on improving the speed of object freeing in the "slowpath" of kmem_cache_free_bulk. The calls slab_free (fastpath) and __slab_free (slowpath) have been extended with support for bulk free, which amortize the overhead of the (locked) cmpxchg_double. To use the new bulking feature, we build what I call a detached freelist. The detached freelist takes advantage of three properties: 1) the free function call owns the object that is about to be freed, thus writing into this memory is synchronization-free. 2) many freelist's can co-exist side-by-side in the same slab-page each with a separate head pointer. 3) it is the visibility of the head pointer that needs synchronization. Given these properties, the brilliant part is that the detached freelist can be constructed without any need for synchronization. The freelist is constructed directly in the page objects, without any synchronization needed. The detached freelist is allocated on the stack of the function call kmem_cache_free_bulk. Thus, the freelist head pointer is not visible to other CPUs. All objects in a SLUB freelist must belong to the same slab-page. Thus, constructing the detached freelist is about matching objects that belong to the same slab-page. The bulk free array is scanned is a progressive manor with a limited look-ahead facility. Kmem debug support is handled in call of slab_free(). Notice kmem_cache_free_bulk no longer need to disable IRQs. This only slowed down single free bulk with approx 3 cycles. Performance data: Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns To get stable and comparable numbers, the kernel have been booted with "slab_merge" (this also improve performance for larger bulk sizes). Performance data, compared against fallback bulking: bulk - fallback bulk - improvement with this patch 1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0% 2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5% 3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6% 4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5% 8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0% 16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3% 30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3% 32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0% 34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0% 48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7% 64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0% 128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0% 158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7% 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4% Performance data, compared current in-kernel bulking: bulk - curr in-kernel - improvement with this patch 1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5% 2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1% 3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5% 4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1% 8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9% 16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6% 30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5% 48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0% 64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2% 128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9% 158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0% 250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0% Performance with normal SLUB merging is significantly slower for larger bulking. This is believed to (primarily) be an effect of not having to share the per-CPU data-structures, as tuning per-CPU size can achieve similar performance. bulk - slab_nomerge - normal SLUB merge 1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0 2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0 3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0 4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0 8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0 16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0 30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5 32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4 34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1 48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1 64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28 128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30 158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29 250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19 Joint work with Alexander Duyck. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexander Duyck --- mm/slub.c | 109 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 79 insertions(+), 30 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 13b5f53e4840..c25717ab3b5a 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2832,44 +2832,93 @@ void kmem_cache_free(struct kmem_cache *s, void *x) } EXPORT_SYMBOL(kmem_cache_free); -/* Note that interrupts must be enabled when calling this function. */ -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) -{ - struct kmem_cache_cpu *c; +struct detached_freelist { struct page *page; - int i; + void *tail_object; + void *freelist; + int cnt; +}; - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); +/* + * This function progressively scans the array with free objects (with + * a limited look ahead) and extract objects belonging to the same + * page. It builds a detached freelist directly within the given + * page/objects. This can happen without any need for + * synchronization, because the objects are owned by running process. + * The freelist is build up as a single linked list in the objects. + * The idea is, that this detached freelist can then be bulk + * transferred to the real freelist(s), but only requiring a single + * synchronization primitive. Look ahead in the array is limited due + * to performance reasons. + */ +static int build_detached_freelist(struct kmem_cache *s, size_t size, + void **p, struct detached_freelist *df) +{ + size_t first_skipped_index = 0; + int lookahead = 3; + void *object; - for (i = 0; i < size; i++) { - void *object = p[i]; + /* Always re-init detached_freelist */ + df->page = NULL; - BUG_ON(!object); - /* kmem cache debug support */ - s = cache_from_obj(s, object); - if (unlikely(!s)) - goto exit; - slab_free_hook(s, object); + do { + object = p[--size]; + } while (!object && size); - page = virt_to_head_page(object); + if (!object) + return 0; - if (c->page == page) { - /* Fastpath: local CPU free */ - set_freepointer(s, object, c->freelist); - c->freelist = object; - } else { - c->tid = next_tid(c->tid); - local_irq_enable(); - /* Slowpath: overhead locked cmpxchg_double_slab */ - __slab_free(s, page, object, _RET_IP_, NULL, 1); - local_irq_disable(); - c = this_cpu_ptr(s->cpu_slab); + /* Start new detached freelist */ + set_freepointer(s, object, NULL); + df->page = virt_to_head_page(object); + df->tail_object = object; + df->freelist = object; + p[size] = NULL; /* mark object processed */ + df->cnt = 1; + + while (size) { + object = p[--size]; + if (!object) + continue; /* Skip processed objects */ + + /* df->page is always set at this point */ + if (df->page == virt_to_head_page(object)) { + /* Opportunity build freelist */ + set_freepointer(s, object, df->freelist); + df->freelist = object; + df->cnt++; + p[size] = NULL; /* mark object processed */ + + continue; } + + /* Limit look ahead search */ + if (!--lookahead) + break; + + if (!first_skipped_index) + first_skipped_index = size + 1; } -exit: - c->tid = next_tid(c->tid); - local_irq_enable(); + + return first_skipped_index; +} + + +/* Note that interrupts must be enabled when calling this function. */ +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) +{ + BUG_ON(!size); + + do { + struct detached_freelist df; + + size = build_detached_freelist(s, size, p, &df); + if (unlikely(!df.page)) + continue; + + slab_free(s, df.page, df.tail_object, _RET_IP_, + df.freelist, df.cnt); + } while (likely(size)); } EXPORT_SYMBOL(kmem_cache_free_bulk); From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Lameter Subject: Re: [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist Date: Mon, 28 Sep 2015 10:22:50 -0500 (CDT) Message-ID: References: <20150928122444.15409.10498.stgit@canyon> <20150928122634.15409.6956.stgit@canyon> Content-Type: text/plain; charset=US-ASCII Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim To: Jesper Dangaard Brouer Return-path: Received: from resqmta-ch2-03v.sys.comcast.net ([69.252.207.35]:37126 "EHLO resqmta-ch2-03v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933905AbbI1PWv (ORCPT ); Mon, 28 Sep 2015 11:22:51 -0400 In-Reply-To: <20150928122634.15409.6956.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: Acked-by: Christoph Lameter From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists Date: Mon, 28 Sep 2015 17:51:14 +0200 Message-ID: <20150928175114.07e85114@redhat.com> References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim , brouer@redhat.com To: Christoph Lameter Return-path: Received: from mx1.redhat.com ([209.132.183.28]:45301 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934034AbbI1PvT (ORCPT ); Mon, 28 Sep 2015 11:51:19 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 28 Sep 2015 10:16:49 -0500 (CDT) Christoph Lameter wrote: > On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > > > diff --git a/mm/slub.c b/mm/slub.c > > index 1cf98d89546d..13b5f53e4840 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) > > { > > u8 *p = object; > > > > + /* Freepointer not overwritten as SLAB_POISON moved it after object */ > > if (s->flags & __OBJECT_POISON) { > > memset(p, POISON_FREE, s->object_size - 1); > > p[s->object_size - 1] = POISON_END; > > } > > > > + /* > > + * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then > > + * freepointer is still safe, as then s->offset equals > > + * s->inuse and below redzone is after s->object_size and only > > + * area between s->object_size and s->inuse. > > + */ > > if (s->flags & SLAB_RED_ZONE) > > memset(p + s->object_size, val, s->inuse - s->object_size); > > } > > Are these comments really adding something? This is basic metadata > handling for SLUB that is commented on elsehwere. Not knowing SLUB as well as you, it took me several hours to realize init_object() didn't overwrite the freepointer in the object. Thus, I think these comments make the reader aware of not-so-obvious side-effects of SLAB_POISON and SLAB_RED_ZONE. > > @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); > > * So we still attempt to reduce cache line usage. Just take the slab > > * lock and free the item. If there is no additional partial page > > * handling required then we can return immediately. > > + * > > + * Bulk free of a freelist with several objects (all pointing to the > > + * same page) possible by specifying freelist_head ptr and object as > > + * tail ptr, plus objects count (cnt). > > */ > > static void __slab_free(struct kmem_cache *s, struct page *page, > > - void *x, unsigned long addr) > > + void *x, unsigned long addr, > > + void *freelist_head, int cnt) > > Do you really need separate parameters for freelist_head? If you just want > to deal with one object pass it as freelist_head and set cnt = 1? Yes, I need it. We need to know both the head and tail of the list to splice it. See: > @@ -2612,7 +2681,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, prior = page->freelist; counters = page->counters; > set_freepointer(s, object, prior); ^^^^^^ Here we update the tail ptr (object) to point to "prior" (page->freelist). > new.counters = counters; > was_frozen = new.frozen; > - new.inuse--; > + new.inuse -= cnt; > if ((!new.inuse || !prior) && !was_frozen) { > > if (kmem_cache_has_cpu_partial(s) && !prior) { > @@ -2643,7 +2712,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > > } while (!cmpxchg_double_slab(s, page, > prior, counters, > - object, new.counters, > + new_freelist, new.counters, > "__slab_free")); Here we update page->freelist ("prior") to point to the head. Thus, splicing the list. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() Date: Mon, 28 Sep 2015 17:59:01 +0200 Message-ID: <20150928175901.39976cdb@redhat.com> References: <20150928122444.15409.10498.stgit@canyon> <20150928122639.15409.21583.stgit@canyon> <5609545C.4010807@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Pekka Enberg , David Rientjes , Christoph Lameter , Joonsoo Kim , brouer@redhat.com To: Alexander Duyck Return-path: Received: from mx1.redhat.com ([209.132.183.28]:51909 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934252AbbI1P7H (ORCPT ); Mon, 28 Sep 2015 11:59:07 -0400 In-Reply-To: <5609545C.4010807@gmail.com> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 28 Sep 2015 07:53:16 -0700 Alexander Duyck wrote: > On 09/28/2015 05:26 AM, Jesper Dangaard Brouer wrote: > > For practical use-cases it is beneficial to prefetch the next freelist > > object in bulk allocation loop. > > > > Micro benchmarking show approx 1 cycle change: > > > > bulk - prev-patch - this patch > > 1 - 49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0 > > 2 - 30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1 > > 3 - 23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2 > > 4 - 20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2 > > 8 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 > > 16 - 17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1 > > 30 - 18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1 > > 32 - 18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1 > > 34 - 23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1 > > 48 - 21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1 > > 64 - 20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1 > > 128 - 27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0 > > 158 - 30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0 > > 250 - 37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0 > > > > Note, benchmark done with slab_nomerge to keep it stable enough > > for accurate comparison. > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > mm/slub.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/mm/slub.c b/mm/slub.c > > index c25717ab3b5a..5af75a618b91 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > > goto error; > > > > c = this_cpu_ptr(s->cpu_slab); > > + prefetch_freepointer(s, c->freelist); > > continue; /* goto for-loop */ > > } > > > > @@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > > goto error; > > > > c->freelist = get_freepointer(s, object); > > + prefetch_freepointer(s, c->freelist); > > p[i] = object; > > > > /* kmem_cache debug support */ > > > > I can see the prefetch in the last item case being possibly useful since > you have time between when you call the prefetch and when you are > accessing the next object. However, is there any actual benefit to > prefetching inside the loop itself? Based on your data above it doesn't > seem like that is the case since you are now adding one additional cycle > to the allocation and I am not seeing any actual gain reported here. The gain will first show up, when using bulk alloc in real use-cases. As you know, bulk alloc on RX path don't show any improvement. And I measured (with perf-mem-record) L1 miss'es here. I could reduce the L1 misses here by adding prefetch. But I cannot remember if I measured any PPS improvement with this. As you hint, the time I have between my prefetch and use is very small, thus the question is if this will show any benefit for real use-cases. We can drop this patch, and then I'll include it in my network use-case, and measure the effect? (Although I'll likely be wasting my time, as we should likely redesign the alloc API instead). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists Date: Tue, 29 Sep 2015 09:12:19 +0200 Message-ID: <20150929091219.40d1e217@redhat.com> References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> <20150928175114.07e85114@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim , brouer@redhat.com To: Christoph Lameter Return-path: Received: from mx1.redhat.com ([209.132.183.28]:49383 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932622AbbI2HMZ (ORCPT ); Tue, 29 Sep 2015 03:12:25 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 28 Sep 2015 11:30:00 -0500 (CDT) Christoph Lameter wrote: > On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > > > Not knowing SLUB as well as you, it took me several hours to realize > > init_object() didn't overwrite the freepointer in the object. Thus, I > > think these comments make the reader aware of not-so-obvious > > side-effects of SLAB_POISON and SLAB_RED_ZONE. > > From the source: > > /* > * Object layout: > * > * object address > * Bytes of the object to be managed. > * If the freepointer may overlay the object then the free > * pointer is the first word of the object. > * > * Poisoning uses 0x6b (POISON_FREE) and the last byte is > * 0xa5 (POISON_END) > * > * object + s->object_size > * Padding to reach word boundary. This is also used for Redzoning. > * Padding is extended by another word if Redzoning is enabled and > * object_size == inuse. > * > * We fill with 0xbb (RED_INACTIVE) for inactive objects and with > * 0xcc (RED_ACTIVE) for objects in use. > * > * object + s->inuse > * Meta data starts here. > * > * A. Free pointer (if we cannot overwrite object on free) > * B. Tracking data for SLAB_STORE_USER > * C. Padding to reach required alignment boundary or at mininum > * one word if debugging is on to be able to detect writes > * before the word boundary. Okay, I will remove the comment. The best doc on SLUB and SLAB layout comes from your slides titled: "Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB" Lets gracefully add a link to the slides here: http://events.linuxfoundation.org/sites/events/files/slides/slaballocators.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists Date: Tue, 29 Sep 2015 09:32:24 +0200 Message-ID: <20150929093224.65bed249@redhat.com> References: <20150928122444.15409.10498.stgit@canyon> <20150928122629.15409.69466.stgit@canyon> <20150928175114.07e85114@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Andrew Morton , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes , Joonsoo Kim , brouer@redhat.com To: Christoph Lameter Return-path: Received: from mx1.redhat.com ([209.132.183.28]:35736 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755371AbbI2Hc3 (ORCPT ); Tue, 29 Sep 2015 03:32:29 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 28 Sep 2015 11:28:15 -0500 (CDT) Christoph Lameter wrote: > On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote: > > > > Do you really need separate parameters for freelist_head? If you just want > > > to deal with one object pass it as freelist_head and set cnt = 1? > > > > Yes, I need it. We need to know both the head and tail of the list to > > splice it. > > Ok so this is to avoid having to scan the list to its end? True. > x is the end > of the list and freelist_head the beginning. That is weird. Yes, it is a bit weird... the bulk free of freelists comes out as a second-class citizen. Okay, I'll try to change the slab_free() and __slab_free() calls to have a "head" and "tail". And let tail be NULL on single object free, to allow compiler to do constant propagation (thus keeping existing fastpath unaffected). (The same code should be generated, but we will have a more intuitive API). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Duyck Subject: Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists Date: Tue, 29 Sep 2015 10:20:20 -0700 Message-ID: <560AC854.6040601@gmail.com> References: <20150929154605.14465.98995.stgit@canyon> <20150929154807.14465.76422.stgit@canyon> <560ABE86.9050508@gmail.com> <20150929190029.01ca01f2@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Pekka Enberg , David Rientjes , Joonsoo Kim To: Jesper Dangaard Brouer Return-path: Received: from mail-pa0-f41.google.com ([209.85.220.41]:35094 "EHLO mail-pa0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934840AbbI2RUW (ORCPT ); Tue, 29 Sep 2015 13:20:22 -0400 Received: by pacfv12 with SMTP id fv12so11964300pac.2 for ; Tue, 29 Sep 2015 10:20:22 -0700 (PDT) In-Reply-To: <20150929190029.01ca01f2@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On 09/29/2015 10:00 AM, Jesper Dangaard Brouer wrote: > On Tue, 29 Sep 2015 09:38:30 -0700 > Alexander Duyck wrote: > >> On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote: >>> +#if defined(CONFIG_KMEMCHECK) || \ >>> + defined(CONFIG_LOCKDEP) || \ >>> + defined(CONFIG_DEBUG_KMEMLEAK) || \ >>> + defined(CONFIG_DEBUG_OBJECTS_FREE) || \ >>> + defined(CONFIG_KASAN) >>> +static inline void slab_free_freelist_hook(struct kmem_cache *s, >>> + void *head, void *tail) >>> +{ >>> + void *object = head; >>> + void *tail_obj = tail ? : head; >>> + >>> + do { >>> + slab_free_hook(s, object); >>> + } while ((object != tail_obj) && >>> + (object = get_freepointer(s, object))); >>> +} >>> +#else >>> +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail, >>> + void *freelist_head) {} >>> +#endif >>> + >> Instead of messing around with an #else you might just wrap the contents >> of slab_free_freelist_hook in the #if/#endif instead of the entire >> function declaration. > I had it that way in an earlier version of the patch, but I liked > better this way. It would be nice if the argument names were the same for both cases. Having the names differ will make it more difficult to maintain when changes need to be made to the function. - Alex From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Date: Thu, 1 Oct 2015 15:10:15 -0700 Message-ID: <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes To: Jesper Dangaard Brouer Return-path: Received: from mail.linuxfoundation.org ([140.211.169.12]:43804 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750770AbbJAWKR (ORCPT ); Thu, 1 Oct 2015 18:10:17 -0400 In-Reply-To: <20150930114255.13505.2618.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer wrote: > Make it possible to free a freelist with several objects by adjusting > API of slab_free() and __slab_free() to have head, tail and an objects > counter (cnt). > > Tail being NULL indicate single object free of head object. This > allow compiler inline constant propagation in slab_free() and > slab_free_freelist_hook() to avoid adding any overhead in case of > single object free. > > This allows a freelist with several objects (all within the same > slab-page) to be free'ed using a single locked cmpxchg_double in > __slab_free() and with an unlocked cmpxchg_double in slab_free(). > > Object debugging on the free path is also extended to handle these > freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if > objects don't belong to the same slab-page. > > These changes are needed for the next patch to bulk free the detached > freelists it introduces and constructs. > > Micro benchmarking showed no performance reduction due to this change, > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). > checkpatch says WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() #205: FILE: mm/slub.c:2888: + BUG_ON(!size); Linus will get mad at you if he finds out, and we wouldn't want that. --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix +++ a/mm/slub.c @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc /* Note that interrupts must be enabled when calling this function. */ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) { - BUG_ON(!size); + if (WARN_ON(!size)) + return; do { struct detached_freelist df; _ From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Date: Fri, 2 Oct 2015 14:50:44 -0700 Message-ID: <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , Hannes Frederic Sowa To: Jesper Dangaard Brouer Return-path: Received: from mail.linuxfoundation.org ([140.211.169.12]:57082 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751295AbbJBVuq (ORCPT ); Fri, 2 Oct 2015 17:50:46 -0400 In-Reply-To: <20151002154039.69f82bdc@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer wrote: > > Thus, I need introducing new code like this patch and at the same time > > have to reduce the number of instruction-cache misses/usage. In this > > case we solve the problem by kmem_cache_free_bulk() not getting called > > too often. Thus, +17 bytes will hopefully not matter too much... but on > > the other hand we sort-of know that calling kmem_cache_free_bulk() will > > cause icache misses. > > I just tested this change on top of my net-use-case patchset... and for > some strange reason the code with this WARN_ON is faster and have much > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses). > > Thus, I think we should keep your fix. > > I cannot explain why using WARN_ON() is better and cause less icache > misses. And I hate when I don't understand every detail. > > My theory is, after reading the assembler code, that the UD2 > instruction (from BUG_ON) cause some kind of icache decoder stall > (Intel experts???). Now that should not be a problem, as UD2 is > obviously placed as an unlikely branch and left at the end of the asm > function call. But the call to __slab_free() is also placed at the end > of the asm function (gets inlined from slab_free() as unlikely). And > it is actually fairly likely that bulking is calling __slab_free (slub > slowpath call). Yes, I was looking at the asm code and the difference is pretty small: a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly. But I wouldn't assume that the microbenchmarking is meaningful. I've seen shockingly large (and quite repeatable) microbenchmarking differences from small changes in code which isn't even executed (and this is one such case, actually). You add or remove just one byte of text and half the kernel (or half the .o file?) gets a different alignment and this seems to change everything. Deleting the BUG altogether sounds the best solution. As long as the kernel crashes in some manner, we'll be able to work out what happened. And it's cant-happen anyway, isn't it? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andi Kleen Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Date: Mon, 5 Oct 2015 14:20:45 -0700 Message-ID: <20151005212045.GG26924@tassilo.jf.intel.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Andrew Morton , linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , Pekka Enberg , netdev@vger.kernel.org, Joonsoo Kim , David Rientjes , Hannes Frederic Sowa , Arnaldo Carvalho de Melo To: Jesper Dangaard Brouer Return-path: Received: from mga01.intel.com ([192.55.52.88]:23569 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751639AbbJEVUq (ORCPT ); Mon, 5 Oct 2015 17:20:46 -0400 Content-Disposition: inline In-Reply-To: <20151005212639.35932b6c@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: > My only problem left, is I want a perf measurement that pinpoint these > kind of spots. The difference in L1-icache-load-misses were significant > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > different perf events without being able to pinpoint the location (even > though I know the spot now). Even tried Andi's ocperf.py... maybe he > will know what event I should try? Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the bottle neck is and what to sample for if there is a suitable event and even prints the command line. https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev However frontend issues are difficult to sample, as they happen very far away from instruction retirement where the sampling happens. So you may have large skid and the sampling points may be far away. Skylake has new special FRONTEND_* PEBS events for this, but before it was often difficult. BTW if your main goal is icache; I wrote a gcc patch to help the kernel by enabling function splitting: Apply the patch in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66890 to gcc 5, make sure 9bebe9e5b0f (now in mainline) is applied and build with -freorder-blocks-and-partition. That will split all functions into statically predicted hot and cold parts and generally relieves icache pressure. Any testing of this on your workload welcome. -Andi From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Date: Tue, 6 Oct 2015 01:53:37 +0200 Message-ID: <20151006015337.0fab1547@redhat.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-mm@kvack.org, Christoph Lameter , netdev@vger.kernel.org, Andi Kleen , Arnaldo Carvalho de Melo , brouer@redhat.com To: unlisted-recipients:; (no To-header on input) Return-path: Received: from mx1.redhat.com ([209.132.183.28]:42297 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751010AbbJEXxl convert rfc822-to-8bit (ORCPT ); Mon, 5 Oct 2015 19:53:41 -0400 In-Reply-To: <20151005212639.35932b6c@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 5 Oct 2015 21:26:39 +0200 Jesper Dangaard Brouer wrote: > My only problem left, is I want a perf measurement that pinpoint thes= e > kind of spots. The difference in L1-icache-load-misses were signific= ant > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > different perf events without being able to pinpoint the location (ev= en > though I know the spot now). Even tried Andi's ocperf.py... maybe he > will know what event I should try? Using: 'ocperf.py -e icache_misses' and looking closer at the perf annotate and considering "skid" I think I can see the icache misses happening in the end of the function, due to the UD2 inst. Annotation of kmem_cache_free_bulk (last/end of func): =E2=94=8217b: test %r12,%r12 =E2=94=82 =E2=86=91 jne 2e =E2=94=82184: pop %rbx =E2=94=82 pop %r12 =E2=94=82 pop %r13 =E2=94=82 pop %r14 =E2=94=82 pop %r15 =E2=94=82 pop %rbp =E2=94=82 =E2=86=90 retq 8.57 =E2=94=8218f: mov 0x30(%rdx),%rdx 5.71 =E2=94=82 =E2=86=91 jmp 116 =E2=94=82195: ud2 2.86 =E2=94=82197: mov %rdi,%rsi =E2=94=82 mov %r11d,%r8d =E2=94=82 mov %r10,%rcx =E2=94=82 mov %rbx,%rdx =E2=94=82 mov %r15,%rdi =E2=94=82 =E2=86=92 callq __slab_free =E2=94=82 =E2=86=91 jmp 17b 2.86 =E2=94=821ad: mov 0x30(%rdi),%rdi =E2=94=82 =E2=86=91 jmpq 99 --=20 Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Date: Wed, 7 Oct 2015 12:39:19 +0200 Message-ID: <20151007123919.27a9c823@redhat.com> References: <560ABE86.9050508@gmail.com> <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: linux-mm@kvack.org, Christoph Lameter , Alexander Duyck , netdev@vger.kernel.org, Arnaldo Carvalho de Melo , brouer@redhat.com To: Andrew Morton Return-path: Received: from mx1.redhat.com ([209.132.183.28]:42471 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750916AbbJGKjX (ORCPT ); Wed, 7 Oct 2015 06:39:23 -0400 In-Reply-To: <20151005212639.35932b6c@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 5 Oct 2015 21:26:39 +0200 Jesper Dangaard Brouer wrote: > On Fri, 2 Oct 2015 14:50:44 -0700 Andrew Morton wrote: > [...] > > > Deleting the BUG altogether sounds the best solution. As long as the > > kernel crashes in some manner, we'll be able to work out what happened. > > And it's cant-happen anyway, isn't it? > > To me WARN_ON() seems like a good "documentation" if it does not hurt > performance. I don't think removing the WARN_ON() will improve > performance, but I'm willing to actually test if it matters. I tested removing BUG/WARN_ON altogether, and it gives slightly worse performance. The icache-misses only increase approx 14% (not 112% as before). This, I'm willing to attribute to some code alignment issue. Thus, let us just keep the WARN_ON() and move along. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andi Kleen Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists Date: Wed, 7 Oct 2015 08:44:56 -0700 Message-ID: <20151007154456.GJ26924@tassilo.jf.intel.com> References: <20150930114255.13505.2618.stgit@canyon> <20151001151015.c59a1360c7720a257f655578@linux-foundation.org> <20151002114118.75aae2f9@redhat.com> <20151002154039.69f82bdc@redhat.com> <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org> <20151005212639.35932b6c@redhat.com> <20151005212045.GG26924@tassilo.jf.intel.com> <20151006010703.09e2f0ff@redhat.com> <20151007143120.7068416d@redhat.com> <20151007133619.GR12682@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jesper Dangaard Brouer , linux-mm@kvack.org, netdev@vger.kernel.org To: Arnaldo Carvalho de Melo Return-path: Received: from mga03.intel.com ([134.134.136.65]:13805 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753954AbbJGPpD (ORCPT ); Wed, 7 Oct 2015 11:45:03 -0400 Content-Disposition: inline In-Reply-To: <20151007133619.GR12682@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: > There is a recent patch that may help here, see below, but maybe its > just a matter of removing that :pp, as it ends with a /pp anyway, no > need to state that twice :) Yes the extra :pp was a regression in toplev. I fixed it now. Thanks. -Andi From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joonsoo Kim Subject: Re: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Date: Wed, 14 Oct 2015 14:15:25 +0900 Message-ID: <20151014051524.GA29286@js1304-P5Q-DELUXE> References: <20150929154605.14465.98995.stgit@canyon> <20150929154822.14465.50207.stgit@canyon> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes To: Jesper Dangaard Brouer Return-path: Received: from LGEAMRELO13.lge.com ([156.147.23.53]:47690 "EHLO lgeamrelo13.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750807AbbJNFOk (ORCPT ); Wed, 14 Oct 2015 01:14:40 -0400 Content-Disposition: inline In-Reply-To: <20150929154822.14465.50207.stgit@canyon> Sender: netdev-owner@vger.kernel.org List-ID: On Tue, Sep 29, 2015 at 05:48:26PM +0200, Jesper Dangaard Brouer wrote: > This change focus on improving the speed of object freeing in the > "slowpath" of kmem_cache_free_bulk. > > The calls slab_free (fastpath) and __slab_free (slowpath) have been > extended with support for bulk free, which amortize the overhead of > the (locked) cmpxchg_double. > > To use the new bulking feature, we build what I call a detached > freelist. The detached freelist takes advantage of three properties: > > 1) the free function call owns the object that is about to be freed, > thus writing into this memory is synchronization-free. > > 2) many freelist's can co-exist side-by-side in the same slab-page > each with a separate head pointer. > > 3) it is the visibility of the head pointer that needs synchronization. > > Given these properties, the brilliant part is that the detached > freelist can be constructed without any need for synchronization. The > freelist is constructed directly in the page objects, without any > synchronization needed. The detached freelist is allocated on the > stack of the function call kmem_cache_free_bulk. Thus, the freelist > head pointer is not visible to other CPUs. > > All objects in a SLUB freelist must belong to the same slab-page. > Thus, constructing the detached freelist is about matching objects > that belong to the same slab-page. The bulk free array is scanned is > a progressive manor with a limited look-ahead facility. > > Kmem debug support is handled in call of slab_free(). > > Notice kmem_cache_free_bulk no longer need to disable IRQs. This > only slowed down single free bulk with approx 3 cycles. > > > Performance data: > Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz > > SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns > > To get stable and comparable numbers, the kernel have been booted with > "slab_merge" (this also improve performance for larger bulk sizes). > > Performance data, compared against fallback bulking: > > bulk - fallback bulk - improvement with this patch > 1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0% > 2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5% > 3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6% > 4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5% > 8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0% > 16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3% > 30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3% > 32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0% > 34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0% > 48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7% > 64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0% > 128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0% > 158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7% > 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4% > > Performance data, compared current in-kernel bulking: > > bulk - curr in-kernel - improvement with this patch > 1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5% > 2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1% > 3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5% > 4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1% > 8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9% > 16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6% > 30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% > 32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% > 34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5% > 48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0% > 64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2% > 128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9% > 158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0% > 250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0% > > Performance with normal SLUB merging is significantly slower for > larger bulking. This is believed to (primarily) be an effect of not > having to share the per-CPU data-structures, as tuning per-CPU size > can achieve similar performance. > > bulk - slab_nomerge - normal SLUB merge > 1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0 > 2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0 > 3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0 > 4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0 > 8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0 > 16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0 > 30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5 > 32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4 > 34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1 > 48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1 > 64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28 > 128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30 > 158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29 > 250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19 > > Joint work with Alexander Duyck. > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c > > Signed-off-by: Jesper Dangaard Brouer > Signed-off-by: Alexander Duyck > Acked-by: Christoph Lameter > --- > mm/slub.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 78 insertions(+), 30 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 7c2abc33fd4e..53500f3b70ab 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2812,44 +2812,92 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > } > EXPORT_SYMBOL(kmem_cache_free); > > -/* Note that interrupts must be enabled when calling this function. */ > -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > -{ > - struct kmem_cache_cpu *c; > +struct detached_freelist { > struct page *page; > - int i; > + void *tail; > + void *freelist; > + int cnt; > +}; > > - local_irq_disable(); > - c = this_cpu_ptr(s->cpu_slab); > +/* > + * This function progressively scans the array with free objects (with > + * a limited look ahead) and extract objects belonging to the same > + * page. It builds a detached freelist directly within the given > + * page/objects. This can happen without any need for > + * synchronization, because the objects are owned by running process. > + * The freelist is build up as a single linked list in the objects. > + * The idea is, that this detached freelist can then be bulk > + * transferred to the real freelist(s), but only requiring a single > + * synchronization primitive. Look ahead in the array is limited due > + * to performance reasons. > + */ > +static int build_detached_freelist(struct kmem_cache *s, size_t size, > + void **p, struct detached_freelist *df) > +{ > + size_t first_skipped_index = 0; > + int lookahead = 3; > + void *object; > > - for (i = 0; i < size; i++) { > - void *object = p[i]; > + /* Always re-init detached_freelist */ > + df->page = NULL; > > - BUG_ON(!object); > - /* kmem cache debug support */ > - s = cache_from_obj(s, object); > - if (unlikely(!s)) > - goto exit; > - slab_free_hook(s, object); > + do { > + object = p[--size]; > + } while (!object && size); > > - page = virt_to_head_page(object); > + if (!object) > + return 0; > > - if (c->page == page) { > - /* Fastpath: local CPU free */ > - set_freepointer(s, object, c->freelist); > - c->freelist = object; > - } else { > - c->tid = next_tid(c->tid); > - local_irq_enable(); > - /* Slowpath: overhead locked cmpxchg_double_slab */ > - __slab_free(s, page, object, object, 1, _RET_IP_); > - local_irq_disable(); > - c = this_cpu_ptr(s->cpu_slab); > + /* Start new detached freelist */ > + set_freepointer(s, object, NULL); > + df->page = virt_to_head_page(object); > + df->tail = object; > + df->freelist = object; > + p[size] = NULL; /* mark object processed */ > + df->cnt = 1; Hello, Jesper. AFAIK, it is uncommon to clear pointer to object in argument array. At least, it is better to comment it on somewhere. Or, how about removing lookahead facility? Does it have real benefit? Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joonsoo Kim Subject: Re: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Date: Thu, 5 Nov 2015 14:09:25 +0900 Message-ID: <20151105050924.GD20374@js1304-P5Q-DELUXE> References: <20150929154605.14465.98995.stgit@canyon> <20150929154822.14465.50207.stgit@canyon> <20151014051524.GA29286@js1304-P5Q-DELUXE> <20151021095709.167e58d2@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-mm@kvack.org, Andrew Morton , Christoph Lameter , netdev@vger.kernel.org, Alexander Duyck , Pekka Enberg , David Rientjes To: Jesper Dangaard Brouer Return-path: Received: from LGEAMRELO13.lge.com ([156.147.23.53]:38706 "EHLO lgeamrelo13.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750766AbbKEFJP (ORCPT ); Thu, 5 Nov 2015 00:09:15 -0500 Content-Disposition: inline In-Reply-To: <20151021095709.167e58d2@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, Oct 21, 2015 at 09:57:09AM +0200, Jesper Dangaard Brouer wrote: > On Wed, 14 Oct 2015 14:15:25 +0900 > Joonsoo Kim wrote: > > > On Tue, Sep 29, 2015 at 05:48:26PM +0200, Jesper Dangaard Brouer wrote: > > > This change focus on improving the speed of object freeing in the > > > "slowpath" of kmem_cache_free_bulk. > > > > > > The calls slab_free (fastpath) and __slab_free (slowpath) have been > > > extended with support for bulk free, which amortize the overhead of > > > the (locked) cmpxchg_double. > > > > > > To use the new bulking feature, we build what I call a detached > > > freelist. The detached freelist takes advantage of three properties: > > > > > > 1) the free function call owns the object that is about to be freed, > > > thus writing into this memory is synchronization-free. > > > > > > 2) many freelist's can co-exist side-by-side in the same slab-page > > > each with a separate head pointer. > > > > > > 3) it is the visibility of the head pointer that needs synchronization. > > > > > > Given these properties, the brilliant part is that the detached > > > freelist can be constructed without any need for synchronization. The > > > freelist is constructed directly in the page objects, without any > > > synchronization needed. The detached freelist is allocated on the > > > stack of the function call kmem_cache_free_bulk. Thus, the freelist > > > head pointer is not visible to other CPUs. > > > > > > All objects in a SLUB freelist must belong to the same slab-page. > > > Thus, constructing the detached freelist is about matching objects > > > that belong to the same slab-page. The bulk free array is scanned is > > > a progressive manor with a limited look-ahead facility. > [...] > > > > Hello, Jesper. > > > > AFAIK, it is uncommon to clear pointer to object in argument array. > > At least, it is better to comment it on somewhere. > > In this case, I think clearing the array is a good thing, as > using/referencing objects after they have been free'ed is a bug (which > can be hard to detect). Okay. > > > Or, how about removing lookahead facility? Does it have real benefit? > > In my earlier patch series I had a version with and without lookahead > facility. Just so I could benchmark the difference. With Alex'es help > we/I tuned the code with the lookahead feature to be just as fast. > Thus, I merged the two patches. (Also did testing for worstcase [1]) > > I do wonder if the lookahead have any real benefit. In micro > benchmarking it might be "just-as-fast", but I do suspect (just the code > size increase) it can affect real use-cases... Should we remove it? > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test03.c If it's not implemented yet, I would say that starting with simple one first. But, now, we already have well implemented one so we don't need to remove it. :) Thanks.