* [PATCH V2 0/6] slub: bulk alloc and free for slub allocator
@ 2015-06-17 14:26 Jesper Dangaard Brouer
  2015-06-17 14:27 ` [PATCH V2 1/6] slub: fix spelling succedd to succeed Jesper Dangaard Brouer
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-17 14:26 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Jesper Dangaard Brouer
With this patchset SLUB allocator now both have bulk alloc and free
implemented.
(This patchset is based on DaveM's net-next tree on-top of commit
89d256bb69f)
This patchset mostly optimizes the "fastpath" where objects are
available on the per CPU fastpath page.  This mostly amortize the
less-heavy none-locked cmpxchg_double used on fastpath.
The "fallback" bulking (e.g __kmem_cache_free_bulk) provides a good
basis for comparison. Measurements[1] of the fallback functions
__kmem_cache_{free,alloc}_bulk have been copied from slab_common.c and
forced "noinline" to force a function call like slab_common.c.
Measurements on CPU CPU i7-4790K @ 4.00GHz
Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns
Measurements last-patch with disabled debugging:
Bulk- fallback                   - this-patch
  1 -  57 cycles(tsc) 14.448 ns  -  44 cycles(tsc) 11.236 ns  improved 22.8%
  2 -  51 cycles(tsc) 12.768 ns  -  28 cycles(tsc)  7.019 ns  improved 45.1%
  3 -  48 cycles(tsc) 12.232 ns  -  22 cycles(tsc)  5.526 ns  improved 54.2%
  4 -  48 cycles(tsc) 12.025 ns  -  19 cycles(tsc)  4.786 ns  improved 60.4%
  8 -  46 cycles(tsc) 11.558 ns  -  18 cycles(tsc)  4.572 ns  improved 60.9%
 16 -  45 cycles(tsc) 11.458 ns  -  18 cycles(tsc)  4.658 ns  improved 60.0%
 30 -  45 cycles(tsc) 11.499 ns  -  18 cycles(tsc)  4.568 ns  improved 60.0%
 32 -  79 cycles(tsc) 19.917 ns  -  65 cycles(tsc) 16.454 ns  improved 17.7%
 34 -  78 cycles(tsc) 19.655 ns  -  63 cycles(tsc) 15.932 ns  improved 19.2%
 48 -  68 cycles(tsc) 17.049 ns  -  50 cycles(tsc) 12.506 ns  improved 26.5%
 64 -  80 cycles(tsc) 20.009 ns  -  63 cycles(tsc) 15.929 ns  improved 21.3%
128 -  94 cycles(tsc) 23.749 ns  -  86 cycles(tsc) 21.583 ns  improved  8.5%
158 -  97 cycles(tsc) 24.299 ns  -  90 cycles(tsc) 22.552 ns  improved  7.2%
250 - 102 cycles(tsc) 25.681 ns  -  98 cycles(tsc) 24.589 ns  improved  3.9%
Benchmarking shows impressive improvements in the "fastpath" with a
small number of objects in the working set.  Once the working set
increases, resulting in activating the "slowpath" (that contains the
heavier locked cmpxchg_double) the improvement decreases.
I'm currently working on also optimizing the "slowpath" (as network
stack use-case hits this), but this patchset should provide a good
foundation for further improvements.
 Rest of my patch queue in this area needs some more work, but
preliminary results are good.  I'm attending Netfilter Workshop[2]
next week, and I'll hopefully return working on further improvements
in this area.
[1] https://github.com/netoptimizer/prototype-kernel/blob/b4688559b/kernel/mm/slab_bulk_test01.c#L80
[2] http://workshop.netfilter.org/2015/
---
Christoph Lameter (1):
      slab: infrastructure for bulk object allocation and freeing
Jesper Dangaard Brouer (5):
      slub: fix spelling succedd to succeed
      slub bulk alloc: extract objects from the per cpu slab
      slub: improve bulk alloc strategy
      slub: initial bulk free implementation
      slub: add support for kmem_cache_debug in bulk calls
 include/linux/slab.h |   10 +++++
 mm/slab.c            |   13 ++++++
 mm/slab.h            |    9 ++++
 mm/slab_common.c     |   23 +++++++++++
 mm/slob.c            |   13 ++++++
 mm/slub.c            |  109 ++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 176 insertions(+), 1 deletion(-)
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 9+ messages in thread
* [PATCH V2 1/6] slub: fix spelling succedd to succeed
  2015-06-17 14:26 [PATCH V2 0/6] slub: bulk alloc and free for slub allocator Jesper Dangaard Brouer
@ 2015-06-17 14:27 ` Jesper Dangaard Brouer
  2015-06-17 14:27 ` [PATCH V2 2/6] slab: infrastructure for bulk object allocation and freeing Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-17 14:27 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Jesper Dangaard Brouer
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index 54c0876b43d5..41624ccabc63 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2712,7 +2712,7 @@ redo:
 	 * Determine the currently cpus per cpu slab.
 	 * The cpu may change afterward. However that does not matter since
 	 * data is retrieved via this pointer. If we are on the same cpu
-	 * during the cmpxchg then the free will succedd.
+	 * during the cmpxchg then the free will succeed.
 	 */
 	do {
 		tid = this_cpu_read(s->cpu_slab->tid);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 9+ messages in thread
* [PATCH V2 2/6] slab: infrastructure for bulk object allocation and freeing
  2015-06-17 14:26 [PATCH V2 0/6] slub: bulk alloc and free for slub allocator Jesper Dangaard Brouer
  2015-06-17 14:27 ` [PATCH V2 1/6] slub: fix spelling succedd to succeed Jesper Dangaard Brouer
@ 2015-06-17 14:27 ` Jesper Dangaard Brouer
  2015-06-17 14:28 ` [PATCH V2 3/6] slub bulk alloc: extract objects from the per cpu slab Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-17 14:27 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Jesper Dangaard Brouer
From: Christoph Lameter <cl@linux.com>
Add the basic infrastructure for alloc/free operations on pointer arrays.
It includes a generic function in the common slab code that is used in
this infrastructure patch to create the unoptimized functionality for slab
bulk operations.
Allocators can then provide optimized allocation functions for situations
in which large numbers of objects are needed.  These optimization may
avoid taking locks repeatedly and bypass metadata creation if all objects
in slab pages can be used to provide the objects required.
Allocators can extend the skeletons provided and add their own code to the
bulk alloc and free functions.  They can keep the generic allocation and
freeing and just fall back to those if optimizations would not work (like
for example when debugging is on).
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
V2: fix kmem_cache_alloc_bulk calling itself
In measurements[1] the fallback functions __kmem_cache_{free,alloc}_bulk
have been copied from slab_common.c and forced "noinline" to force a
function call like slab_common.c.
Bulk- fallback                   - just-invoking-callbacks
  1 -  57 cycles(tsc) 14.500 ns  -  64 cycles(tsc) 16.121 ns
  2 -  51 cycles(tsc) 12.760 ns  -  53 cycles(tsc) 13.422 ns
  3 -  49 cycles(tsc) 12.345 ns  -  51 cycles(tsc) 12.855 ns
  4 -  48 cycles(tsc) 12.110 ns  -  49 cycles(tsc) 12.494 ns
  8 -  46 cycles(tsc) 11.596 ns  -  47 cycles(tsc) 11.768 ns
 16 -  45 cycles(tsc) 11.357 ns  -  45 cycles(tsc) 11.459 ns
 30 -  86 cycles(tsc) 21.622 ns  -  86 cycles(tsc) 21.639 ns
 32 -  83 cycles(tsc) 20.838 ns  -  83 cycles(tsc) 20.849 ns
 34 -  90 cycles(tsc) 22.509 ns  -  90 cycles(tsc) 22.516 ns
 48 -  98 cycles(tsc) 24.692 ns  -  98 cycles(tsc) 24.660 ns
 64 -  99 cycles(tsc) 24.775 ns  -  99 cycles(tsc) 24.848 ns
128 - 105 cycles(tsc) 26.305 ns  - 104 cycles(tsc) 26.065 ns
158 - 104 cycles(tsc) 26.214 ns  - 104 cycles(tsc) 26.139 ns
250 - 105 cycles(tsc) 26.360 ns  - 105 cycles(tsc) 26.309 ns
Measurements clearly show that the extra function call overhead in
kmem_cache_{free,alloc}_bulk is measurable.  Why don't we make
__kmem_cache_{free,alloc}_bulk inline?
[1] https://github.com/netoptimizer/prototype-kernel/blob/b4688559b/kernel/mm/slab_bulk_test01.c#L80
 include/linux/slab.h |   10 ++++++++++
 mm/slab.c            |   13 +++++++++++++
 mm/slab.h            |    9 +++++++++
 mm/slab_common.c     |   23 +++++++++++++++++++++++
 mm/slob.c            |   13 +++++++++++++
 mm/slub.c            |   14 ++++++++++++++
 6 files changed, 82 insertions(+)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index ffd24c830151..5db59c950ef7 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -290,6 +290,16 @@ void *__kmalloc(size_t size, gfp_t flags);
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags);
 void kmem_cache_free(struct kmem_cache *, void *);
 
+/*
+ * Bulk allocation and freeing operations. These are accellerated in an
+ * allocator specific way to avoid taking locks repeatedly or building
+ * metadata structures unnecessarily.
+ *
+ * Note that interrupts must be enabled when calling these functions.
+ */
+void kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
+bool kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
+
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node);
 void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
diff --git a/mm/slab.c b/mm/slab.c
index 7eb38dd1cefa..8d4edc4230db 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3415,6 +3415,19 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	__kmem_cache_free_bulk(s, size, p);
+}
+EXPORT_SYMBOL(kmem_cache_free_bulk);
+
+bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
+								void **p)
+{
+	return __kmem_cache_alloc_bulk(s, flags, size, p);
+}
+EXPORT_SYMBOL(kmem_cache_alloc_bulk);
+
 #ifdef CONFIG_TRACING
 void *
 kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
diff --git a/mm/slab.h b/mm/slab.h
index 4c3ac12dd644..6a427a74cca5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -162,6 +162,15 @@ void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s);
 ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 		       size_t count, loff_t *ppos);
 
+/*
+ * Generic implementation of bulk operations
+ * These are useful for situations in which the allocator cannot
+ * perform optimizations. In that case segments of the objecct listed
+ * may be allocated or freed using these operations.
+ */
+void __kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
+bool __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
+
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * Iterate over all memcg caches of the given root cache. The caller must hold
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 999bb3424d44..f8acc2bdb88b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -105,6 +105,29 @@ static inline int kmem_cache_sanity_check(const char *name, size_t size)
 }
 #endif
 
+void __kmem_cache_free_bulk(struct kmem_cache *s, size_t nr, void **p)
+{
+	size_t i;
+
+	for (i = 0; i < nr; i++)
+		kmem_cache_free(s, p[i]);
+}
+
+bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
+								void **p)
+{
+	size_t i;
+
+	for (i = 0; i < nr; i++) {
+		void *x = p[i] = kmem_cache_alloc(s, flags);
+		if (!x) {
+			__kmem_cache_free_bulk(s, i, p);
+			return false;
+		}
+	}
+	return true;
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 void slab_init_memcg_params(struct kmem_cache *s)
 {
diff --git a/mm/slob.c b/mm/slob.c
index 4765f65019c7..165bbd3cd606 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -611,6 +611,19 @@ void kmem_cache_free(struct kmem_cache *c, void *b)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	__kmem_cache_free_bulk(s, size, p);
+}
+EXPORT_SYMBOL(kmem_cache_free_bulk);
+
+bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
+								void **p)
+{
+	return __kmem_cache_alloc_bulk(s, flags, size, p);
+}
+EXPORT_SYMBOL(kmem_cache_alloc_bulk);
+
 int __kmem_cache_shutdown(struct kmem_cache *c)
 {
 	/* No way to check for remaining objects */
diff --git a/mm/slub.c b/mm/slub.c
index 41624ccabc63..ac5a196d5ea5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2750,6 +2750,20 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	__kmem_cache_free_bulk(s, size, p);
+}
+EXPORT_SYMBOL(kmem_cache_free_bulk);
+
+bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
+								void **p)
+{
+	return __kmem_cache_alloc_bulk(s, flags, size, p);
+}
+EXPORT_SYMBOL(kmem_cache_alloc_bulk);
+
+
 /*
  * Object placement in a slab is made very easy because we always start at
  * offset 0. If we tune the size of the object to the alignment then we can
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 9+ messages in thread
* [PATCH V2 3/6] slub bulk alloc: extract objects from the per cpu slab
  2015-06-17 14:26 [PATCH V2 0/6] slub: bulk alloc and free for slub allocator Jesper Dangaard Brouer
  2015-06-17 14:27 ` [PATCH V2 1/6] slub: fix spelling succedd to succeed Jesper Dangaard Brouer
  2015-06-17 14:27 ` [PATCH V2 2/6] slab: infrastructure for bulk object allocation and freeing Jesper Dangaard Brouer
@ 2015-06-17 14:28 ` Jesper Dangaard Brouer
  2015-06-17 14:28 ` [PATCH V2 4/6] slub: improve bulk alloc strategy Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-17 14:28 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Jesper Dangaard Brouer
First piece: acceleration of retrieval of per cpu objects
If we are allocating lots of objects then it is advantageous to disable
interrupts and avoid the this_cpu_cmpxchg() operation to get these objects
faster.
Note that we cannot do the fast operation if debugging is enabled, because
we would have to add extra code to do all the debugging checks.  And it
would not be fast anyway.
Note also that the requirement of having interrupts disabled
avoids having to do processor flag operations.
Allocate as many objects as possible in the fast way and then fall back to
the generic implementation for the rest of the objects.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
V2:
 - Merged several patches into this
 - Basically rewritten entire function...
Measurements on CPU CPU i7-4790K @ 4.00GHz
Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.554 ns
Bulk- fallback                   - this-patch
  1 -  57 cycles(tsc) 14.432 ns  -  48 cycles(tsc) 12.155 ns  improved 15.8%
  2 -  50 cycles(tsc) 12.746 ns  -  37 cycles(tsc)  9.390 ns  improved 26.0%
  3 -  48 cycles(tsc) 12.180 ns  -  33 cycles(tsc)  8.417 ns  improved 31.2%
  4 -  48 cycles(tsc) 12.015 ns  -  32 cycles(tsc)  8.045 ns  improved 33.3%
  8 -  46 cycles(tsc) 11.526 ns  -  30 cycles(tsc)  7.699 ns  improved 34.8%
 16 -  45 cycles(tsc) 11.418 ns  -  32 cycles(tsc)  8.205 ns  improved 28.9%
 30 -  80 cycles(tsc) 20.246 ns  -  73 cycles(tsc) 18.328 ns  improved  8.8%
 32 -  79 cycles(tsc) 19.946 ns  -  72 cycles(tsc) 18.208 ns  improved  8.9%
 34 -  78 cycles(tsc) 19.659 ns  -  71 cycles(tsc) 17.987 ns  improved  9.0%
 48 -  86 cycles(tsc) 21.516 ns  -  82 cycles(tsc) 20.566 ns  improved  4.7%
 64 -  93 cycles(tsc) 23.423 ns  -  89 cycles(tsc) 22.480 ns  improved  4.3%
128 - 100 cycles(tsc) 25.170 ns  -  99 cycles(tsc) 24.871 ns  improved  1.0%
158 - 102 cycles(tsc) 25.549 ns  - 101 cycles(tsc) 25.375 ns  improved  1.0%
250 - 101 cycles(tsc) 25.344 ns  - 100 cycles(tsc) 25.182 ns  improved  1.0%
 mm/slub.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index ac5a196d5ea5..a92fdec57237 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2750,16 +2750,61 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+/* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	__kmem_cache_free_bulk(s, size, p);
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 
+/* Note that interrupts must be enabled when calling this function. */
 bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
-								void **p)
+			   void **p)
 {
-	return __kmem_cache_alloc_bulk(s, flags, size, p);
+	struct kmem_cache_cpu *c;
+	int i;
+
+	/* Debugging fallback to generic bulk */
+	if (kmem_cache_debug(s))
+		return __kmem_cache_alloc_bulk(s, flags, size, p);
+
+	/*
+	 * Drain objects in the per cpu slab, while disabling local
+	 * IRQs, which protects against PREEMPT and interrupts
+	 * handlers invoking normal fastpath.
+	 */
+	local_irq_disable();
+	c = this_cpu_ptr(s->cpu_slab);
+
+	for (i = 0; i < size; i++) {
+		void *object = c->freelist;
+
+		if (!object)
+			break;
+
+		c->freelist = get_freepointer(s, object);
+		p[i] = object;
+	}
+	c->tid = next_tid(c->tid);
+	local_irq_enable();
+
+	/* Clear memory outside IRQ disabled fastpath loop */
+	if (unlikely(flags & __GFP_ZERO)) {
+		int j;
+
+		for (j = 0; j < i; j++)
+			memset(p[j], 0, s->object_size);
+	}
+
+	/* Fallback to single elem alloc */
+	for (; i < size; i++) {
+		void *x = p[i] = kmem_cache_alloc(s, flags);
+		if (unlikely(!x)) {
+			__kmem_cache_free_bulk(s, i, p);
+			return false;
+		}
+	}
+	return true;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 9+ messages in thread
* [PATCH V2 4/6] slub: improve bulk alloc strategy
  2015-06-17 14:26 [PATCH V2 0/6] slub: bulk alloc and free for slub allocator Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2015-06-17 14:28 ` [PATCH V2 3/6] slub bulk alloc: extract objects from the per cpu slab Jesper Dangaard Brouer
@ 2015-06-17 14:28 ` Jesper Dangaard Brouer
  2015-06-17 14:29 ` [PATCH V2 5/6] slub: initial bulk free implementation Jesper Dangaard Brouer
  2015-06-17 14:29 ` [PATCH V2 6/6] slub: add support for kmem_cache_debug in bulk calls Jesper Dangaard Brouer
  5 siblings, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-17 14:28 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Jesper Dangaard Brouer
Call slowpath __slab_alloc() from within the bulk loop, as the
side-effect of this call likely repopulates c->freelist.
Choose to reenable local IRQs while calling slowpath.
Saving some optimizations for later.  E.g. it is possible to
extract parts of __slab_alloc() and avoid the unnecessary and
expensive (37 cycles) local_irq_{save,restore}.  For now, be
happy calling __slab_alloc() this lower icache impact of this
func and I don't have to worry about correctness.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
V2:
 - remove update of "tid" before call to __slab_alloc()
   not necessary as code-path does not modify per cpu information
Measurements on CPU CPU i7-4790K @ 4.00GHz
Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns
Bulk- fallback                   - this-patch
  1 -  58 cycles(tsc) 14.516 ns  -  49 cycles(tsc) 12.459 ns  improved 15.5%
  2 -  51 cycles(tsc) 12.930 ns  -  38 cycles(tsc)  9.605 ns  improved 25.5%
  3 -  49 cycles(tsc) 12.274 ns  -  34 cycles(tsc)  8.525 ns  improved 30.6%
  4 -  48 cycles(tsc) 12.058 ns  -  32 cycles(tsc)  8.036 ns  improved 33.3%
  8 -  46 cycles(tsc) 11.609 ns  -  31 cycles(tsc)  7.756 ns  improved 32.6%
 16 -  45 cycles(tsc) 11.451 ns  -  32 cycles(tsc)  8.148 ns  improved 28.9%
 30 -  79 cycles(tsc) 19.865 ns  -  68 cycles(tsc) 17.164 ns  improved 13.9%
 32 -  76 cycles(tsc) 19.212 ns  -  66 cycles(tsc) 16.584 ns  improved 13.2%
 34 -  74 cycles(tsc) 18.600 ns  -  63 cycles(tsc) 15.954 ns  improved 14.9%
 48 -  88 cycles(tsc) 22.092 ns  -  77 cycles(tsc) 19.373 ns  improved 12.5%
 64 -  80 cycles(tsc) 20.043 ns  -  68 cycles(tsc) 17.188 ns  improved 15.0%
128 -  99 cycles(tsc) 24.818 ns  -  89 cycles(tsc) 22.404 ns  improved 10.1%
158 -  99 cycles(tsc) 24.977 ns  -  92 cycles(tsc) 23.089 ns  improved  7.1%
250 - 106 cycles(tsc) 26.552 ns  -  99 cycles(tsc) 24.785 ns  improved  6.6%
 mm/slub.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index a92fdec57237..02c33bacd3a6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2779,8 +2779,22 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	for (i = 0; i < size; i++) {
 		void *object = c->freelist;
 
-		if (!object)
-			break;
+		if (unlikely(!object)) {
+			local_irq_enable();
+			/*
+			 * Invoking slow path likely have side-effect
+			 * of re-populating per CPU c->freelist
+			 */
+			p[i] = __slab_alloc(s, flags, NUMA_NO_NODE,
+					    _RET_IP_, c);
+			if (unlikely(!p[i])) {
+				__kmem_cache_free_bulk(s, i, p);
+				return false;
+			}
+			local_irq_disable();
+			c = this_cpu_ptr(s->cpu_slab);
+			continue; /* goto for-loop */
+		}
 
 		c->freelist = get_freepointer(s, object);
 		p[i] = object;
@@ -2796,14 +2810,6 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			memset(p[j], 0, s->object_size);
 	}
 
-	/* Fallback to single elem alloc */
-	for (; i < size; i++) {
-		void *x = p[i] = kmem_cache_alloc(s, flags);
-		if (unlikely(!x)) {
-			__kmem_cache_free_bulk(s, i, p);
-			return false;
-		}
-	}
 	return true;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 9+ messages in thread
* [PATCH V2 5/6] slub: initial bulk free implementation
  2015-06-17 14:26 [PATCH V2 0/6] slub: bulk alloc and free for slub allocator Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2015-06-17 14:28 ` [PATCH V2 4/6] slub: improve bulk alloc strategy Jesper Dangaard Brouer
@ 2015-06-17 14:29 ` Jesper Dangaard Brouer
  2015-06-17 14:29 ` [PATCH V2 6/6] slub: add support for kmem_cache_debug in bulk calls Jesper Dangaard Brouer
  5 siblings, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-17 14:29 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Jesper Dangaard Brouer
This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator
now both have bulk alloc and free implemented.
Choose to reenable local IRQs while calling slowpath __slab_free().
In worst case, where all objects hit slowpath call, the performance
should still be faster than fallback function __kmem_cache_free_bulk(),
because local_irq_{disable+enable} is very fast (7-cycles), while the
fallback invokes this_cpu_cmpxchg() which is slightly slower
(9-cycles). Nitpicking, this should be faster for N>=4, due to the
entry cost of local_irq_{disable+enable}.
Do notice that the save+restore variant is very expensive, this is key
to why this optimization works.
CPU: i7-4790K CPU @ 4.00GHz
 * local_irq_{disable,enable}:  7 cycles(tsc) - 1.821 ns
 * local_irq_{save,restore}  : 37 cycles(tsc) - 9.443 ns
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
V2:
 - Add BUG_ON(!object)
 - No support for kmem_cache_debug()
Measurements on CPU CPU i7-4790K @ 4.00GHz
Baseline normal fastpath (alloc+free cost): 43 cycles(tsc) 10.834 ns
Bulk- fallback                   - this-patch
  1 -  58 cycles(tsc) 14.542 ns  -  43 cycles(tsc) 10.811 ns  improved 25.9%
  2 -  50 cycles(tsc) 12.659 ns  -  27 cycles(tsc)  6.867 ns  improved 46.0%
  3 -  48 cycles(tsc) 12.168 ns  -  21 cycles(tsc)  5.496 ns  improved 56.2%
  4 -  47 cycles(tsc) 11.987 ns  -  24 cycles(tsc)  6.038 ns  improved 48.9%
  8 -  46 cycles(tsc) 11.518 ns  -  17 cycles(tsc)  4.280 ns  improved 63.0%
 16 -  45 cycles(tsc) 11.366 ns  -  17 cycles(tsc)  4.483 ns  improved 62.2%
 30 -  45 cycles(tsc) 11.433 ns  -  18 cycles(tsc)  4.531 ns  improved 60.0%
 32 -  75 cycles(tsc) 18.983 ns  -  58 cycles(tsc) 14.586 ns  improved 22.7%
 34 -  71 cycles(tsc) 17.940 ns  -  53 cycles(tsc) 13.391 ns  improved 25.4%
 48 -  80 cycles(tsc) 20.077 ns  -  65 cycles(tsc) 16.268 ns  improved 18.8%
 64 -  71 cycles(tsc) 17.799 ns  -  53 cycles(tsc) 13.440 ns  improved 25.4%
128 -  91 cycles(tsc) 22.980 ns  -  79 cycles(tsc) 19.899 ns  improved 13.2%
158 - 100 cycles(tsc) 25.241 ns  -  90 cycles(tsc) 22.732 ns  improved 10.0%
250 - 102 cycles(tsc) 25.583 ns  -  95 cycles(tsc) 23.916 ns  improved  6.9%
 mm/slub.c |   34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index 02c33bacd3a6..6ac5921b3389 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2753,7 +2753,39 @@ EXPORT_SYMBOL(kmem_cache_free);
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
-	__kmem_cache_free_bulk(s, size, p);
+	struct kmem_cache_cpu *c;
+	struct page *page;
+	int i;
+
+	/* Debugging fallback to generic bulk */
+	if (kmem_cache_debug(s))
+		return __kmem_cache_free_bulk(s, size, p);
+
+	local_irq_disable();
+	c = this_cpu_ptr(s->cpu_slab);
+
+	for (i = 0; i < size; i++) {
+		void *object = p[i];
+
+		BUG_ON(!object);
+		page = virt_to_head_page(object);
+		BUG_ON(s != page->slab_cache); /* Check if valid slab page */
+
+		if (c->page == page) {
+			/* Fastpath: local CPU free */
+			set_freepointer(s, object, c->freelist);
+			c->freelist = object;
+		} else {
+			c->tid = next_tid(c->tid);
+			local_irq_enable();
+			/* Slowpath: overhead locked cmpxchg_double_slab */
+			__slab_free(s, page, object, _RET_IP_);
+			local_irq_disable();
+			c = this_cpu_ptr(s->cpu_slab);
+		}
+	}
+	c->tid = next_tid(c->tid);
+	local_irq_enable();
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 9+ messages in thread
* [PATCH V2 6/6] slub: add support for kmem_cache_debug in bulk calls
  2015-06-17 14:26 [PATCH V2 0/6] slub: bulk alloc and free for slub allocator Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2015-06-17 14:29 ` [PATCH V2 5/6] slub: initial bulk free implementation Jesper Dangaard Brouer
@ 2015-06-17 14:29 ` Jesper Dangaard Brouer
  2015-06-17 15:08   ` Christoph Lameter
  5 siblings, 1 reply; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-17 14:29 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Jesper Dangaard Brouer
Per request of Joonsoo Kim adding kmem debug support.
I've tested that when debugging is disabled, then there is almost
no performance impact as this code basically gets removed by the
compiler.
Need some guidance in enabling and testing this.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
Measurements with disabled debugging:
bulk- PREVIOUS                  - THIS-PATCH
  1 -  43 cycles(tsc) 10.811 ns -  44 cycles(tsc) 11.236 ns  improved  -2.3%
  2 -  27 cycles(tsc)  6.867 ns -  28 cycles(tsc)  7.019 ns  improved  -3.7%
  3 -  21 cycles(tsc)  5.496 ns -  22 cycles(tsc)  5.526 ns  improved  -4.8%
  4 -  24 cycles(tsc)  6.038 ns -  19 cycles(tsc)  4.786 ns  improved  20.8%
  8 -  17 cycles(tsc)  4.280 ns -  18 cycles(tsc)  4.572 ns  improved  -5.9%
 16 -  17 cycles(tsc)  4.483 ns -  18 cycles(tsc)  4.658 ns  improved  -5.9%
 30 -  18 cycles(tsc)  4.531 ns -  18 cycles(tsc)  4.568 ns  improved   0.0%
 32 -  58 cycles(tsc) 14.586 ns -  65 cycles(tsc) 16.454 ns  improved -12.1%
 34 -  53 cycles(tsc) 13.391 ns -  63 cycles(tsc) 15.932 ns  improved -18.9%
 48 -  65 cycles(tsc) 16.268 ns -  50 cycles(tsc) 12.506 ns  improved  23.1%
 64 -  53 cycles(tsc) 13.440 ns -  63 cycles(tsc) 15.929 ns  improved -18.9%
128 -  79 cycles(tsc) 19.899 ns -  86 cycles(tsc) 21.583 ns  improved  -8.9%
158 -  90 cycles(tsc) 22.732 ns -  90 cycles(tsc) 22.552 ns  improved   0.0%
250 -  95 cycles(tsc) 23.916 ns -  98 cycles(tsc) 24.589 ns  improved  -3.2%
 mm/slub.c |   28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 6ac5921b3389..cb19d5c0e26c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2757,10 +2757,6 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 	struct page *page;
 	int i;
 
-	/* Debugging fallback to generic bulk */
-	if (kmem_cache_debug(s))
-		return __kmem_cache_free_bulk(s, size, p);
-
 	local_irq_disable();
 	c = this_cpu_ptr(s->cpu_slab);
 
@@ -2768,8 +2764,13 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 		void *object = p[i];
 
 		BUG_ON(!object);
+		/* kmem cache debug support */
+		s = cache_from_obj(s, object);
+		if (unlikely(!s))
+			goto exit;
+		slab_free_hook(s, object);
+
 		page = virt_to_head_page(object);
-		BUG_ON(s != page->slab_cache); /* Check if valid slab page */
 
 		if (c->page == page) {
 			/* Fastpath: local CPU free */
@@ -2784,6 +2785,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 			c = this_cpu_ptr(s->cpu_slab);
 		}
 	}
+exit:
 	c->tid = next_tid(c->tid);
 	local_irq_enable();
 }
@@ -2796,10 +2798,6 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	struct kmem_cache_cpu *c;
 	int i;
 
-	/* Debugging fallback to generic bulk */
-	if (kmem_cache_debug(s))
-		return __kmem_cache_alloc_bulk(s, flags, size, p);
-
 	/*
 	 * Drain objects in the per cpu slab, while disabling local
 	 * IRQs, which protects against PREEMPT and interrupts
@@ -2828,8 +2826,20 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			continue; /* goto for-loop */
 		}
 
+		/* kmem_cache debug support */
+		s = slab_pre_alloc_hook(s, flags);
+		if (unlikely(!s)) {
+			__kmem_cache_free_bulk(s, i, p);
+			c->tid = next_tid(c->tid);
+			local_irq_enable();
+			return false;
+		}
+
 		c->freelist = get_freepointer(s, object);
 		p[i] = object;
+
+		/* kmem_cache debug support */
+		slab_post_alloc_hook(s, flags, object);
 	}
 	c->tid = next_tid(c->tid);
 	local_irq_enable();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 9+ messages in thread
* Re: [PATCH V2 6/6] slub: add support for kmem_cache_debug in bulk calls
  2015-06-17 14:29 ` [PATCH V2 6/6] slub: add support for kmem_cache_debug in bulk calls Jesper Dangaard Brouer
@ 2015-06-17 15:08   ` Christoph Lameter
  2015-06-17 15:24     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2015-06-17 15:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: linux-mm, Andrew Morton, Joonsoo Kim
> Per request of Joonsoo Kim adding kmem debug support.
> bulk- PREVIOUS                  - THIS-PATCH
>   1 -  43 cycles(tsc) 10.811 ns -  44 cycles(tsc) 11.236 ns  improved  -2.3%
>   2 -  27 cycles(tsc)  6.867 ns -  28 cycles(tsc)  7.019 ns  improved  -3.7%
>   3 -  21 cycles(tsc)  5.496 ns -  22 cycles(tsc)  5.526 ns  improved  -4.8%
>   4 -  24 cycles(tsc)  6.038 ns -  19 cycles(tsc)  4.786 ns  improved  20.8%
>   8 -  17 cycles(tsc)  4.280 ns -  18 cycles(tsc)  4.572 ns  improved  -5.9%
>  16 -  17 cycles(tsc)  4.483 ns -  18 cycles(tsc)  4.658 ns  improved  -5.9%
>  30 -  18 cycles(tsc)  4.531 ns -  18 cycles(tsc)  4.568 ns  improved   0.0%
>  32 -  58 cycles(tsc) 14.586 ns -  65 cycles(tsc) 16.454 ns  improved -12.1%
>  34 -  53 cycles(tsc) 13.391 ns -  63 cycles(tsc) 15.932 ns  improved -18.9%
>  48 -  65 cycles(tsc) 16.268 ns -  50 cycles(tsc) 12.506 ns  improved  23.1%
>  64 -  53 cycles(tsc) 13.440 ns -  63 cycles(tsc) 15.929 ns  improved -18.9%
> 128 -  79 cycles(tsc) 19.899 ns -  86 cycles(tsc) 21.583 ns  improved  -8.9%
> 158 -  90 cycles(tsc) 22.732 ns -  90 cycles(tsc) 22.552 ns  improved   0.0%
> 250 -  95 cycles(tsc) 23.916 ns -  98 cycles(tsc) 24.589 ns  improved  -3.2%
Hmmm.... Can we afford these regressions?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH V2 6/6] slub: add support for kmem_cache_debug in bulk calls
  2015-06-17 15:08   ` Christoph Lameter
@ 2015-06-17 15:24     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-17 15:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Andrew Morton, Joonsoo Kim, brouer
On Wed, 17 Jun 2015 10:08:28 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:
> > Per request of Joonsoo Kim adding kmem debug support.
> 
> > bulk- PREVIOUS                  - THIS-PATCH
> >   1 -  43 cycles(tsc) 10.811 ns -  44 cycles(tsc) 11.236 ns  improved  -2.3%
> >   2 -  27 cycles(tsc)  6.867 ns -  28 cycles(tsc)  7.019 ns  improved  -3.7%
> >   3 -  21 cycles(tsc)  5.496 ns -  22 cycles(tsc)  5.526 ns  improved  -4.8%
> >   4 -  24 cycles(tsc)  6.038 ns -  19 cycles(tsc)  4.786 ns  improved  20.8%
> >   8 -  17 cycles(tsc)  4.280 ns -  18 cycles(tsc)  4.572 ns  improved  -5.9%
> >  16 -  17 cycles(tsc)  4.483 ns -  18 cycles(tsc)  4.658 ns  improved  -5.9%
> >  30 -  18 cycles(tsc)  4.531 ns -  18 cycles(tsc)  4.568 ns  improved   0.0%
> >  32 -  58 cycles(tsc) 14.586 ns -  65 cycles(tsc) 16.454 ns  improved -12.1%
> >  34 -  53 cycles(tsc) 13.391 ns -  63 cycles(tsc) 15.932 ns  improved -18.9%
> >  48 -  65 cycles(tsc) 16.268 ns -  50 cycles(tsc) 12.506 ns  improved  23.1%
> >  64 -  53 cycles(tsc) 13.440 ns -  63 cycles(tsc) 15.929 ns  improved -18.9%
> > 128 -  79 cycles(tsc) 19.899 ns -  86 cycles(tsc) 21.583 ns  improved  -8.9%
> > 158 -  90 cycles(tsc) 22.732 ns -  90 cycles(tsc) 22.552 ns  improved   0.0%
> > 250 -  95 cycles(tsc) 23.916 ns -  98 cycles(tsc) 24.589 ns  improved  -3.2%
> 
> Hmmm.... Can we afford these regressions?
Do notice the "regression" is mostly within 1 cycle. Which I would not
call a regression, given the accuracy of these measurements.
The page-border cases 32,34,48,64 cannot be use to assess this.
We could look at the assembler code, to see if we can spot the extra
instruction that does not get optimized away.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-06-17 15:24 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-17 14:26 [PATCH V2 0/6] slub: bulk alloc and free for slub allocator Jesper Dangaard Brouer
2015-06-17 14:27 ` [PATCH V2 1/6] slub: fix spelling succedd to succeed Jesper Dangaard Brouer
2015-06-17 14:27 ` [PATCH V2 2/6] slab: infrastructure for bulk object allocation and freeing Jesper Dangaard Brouer
2015-06-17 14:28 ` [PATCH V2 3/6] slub bulk alloc: extract objects from the per cpu slab Jesper Dangaard Brouer
2015-06-17 14:28 ` [PATCH V2 4/6] slub: improve bulk alloc strategy Jesper Dangaard Brouer
2015-06-17 14:29 ` [PATCH V2 5/6] slub: initial bulk free implementation Jesper Dangaard Brouer
2015-06-17 14:29 ` [PATCH V2 6/6] slub: add support for kmem_cache_debug in bulk calls Jesper Dangaard Brouer
2015-06-17 15:08   ` Christoph Lameter
2015-06-17 15:24     ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).