[RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT
@ 2015-06-04 10:31 Jesper Dangaard Brouer
  2015-06-05  2:37 ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-04 10:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jesper Dangaard Brouer, Joonsoo Kim, Alexander Duyck, linux-mm,
	netdev

This patch improves performance of SLUB allocator fastpath with 38% by
avoiding the call to this_cpu_cmpxchg_double() for NO-PREEMPT kernels.

Reviewers please point out why this change is wrong, as such a large
improvement should not be possible ;-)

My primarily motivation for this patch is to understand and
microbenchmark the MM-layer of the kernel, due to an increasing demand
from the networking stack.

This "microbenchmark" is merely to demonstrate the cost of the
instruction CMPXCHG16B (without LOCK prefix).

My microbench is avail on github[1] (reused "qmempool_bench").

The fastpath-reuse (alloc+free cost) (CPU E5-2695):
 * 47 cycles(tsc) - 18.948 ns  (normal with this_cpu_cmpxchg_double)
 * 29 cycles(tsc) - 11.791 ns  (with patch)

Thus, the difference deduct the cost of CMPXCHG16B
 * Total saved 18 cycles - 7.157ns
 * for two CMPXCHG16B (alloc+free): per-inst saved 9 cycles - 3.579ns
 * http://instlatx64.atw.hu/ says 9 cycles cost of CMPXCHG16B

This also shows that the cost of this_cpu_cmpxchg_double() in SLUB is
approx 38% of fast-path cost.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/qmempool_bench.c

The cunning reviewer would also like to know the cost of disabling
interrupts, on this CPU. Here it is interesting to see how the
save/restore variant is significantly more expensive:

Cost of local IRQ toggling (CPU E5-2695):
 *  local_irq_{disable,enable}:  7 cycles(tsc) -  2.861 ns
 *  local_irq_{save,restore}  : 37 cycles(tsc) - 14.846 ns

With the additional overhead of local_irq_{disable,enable}, there
would still be a saving of 11 cycles (out of 47) 23%.
---

 mm/slub.c |   52 +++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 54c0876..b31991f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2489,13 +2489,32 @@ redo:
 		 * against code executing on this cpu *not* from access by
 		 * other cpus.
 		 */
-		if (unlikely(!this_cpu_cmpxchg_double(
-				s->cpu_slab->freelist, s->cpu_slab->tid,
-				object, tid,
-				next_object, next_tid(tid)))) {
-
-			note_cmpxchg_failure("slab_alloc", s, tid);
-			goto redo;
+		if (IS_ENABLED(CONFIG_PREEMPT)) {
+			if (unlikely(!this_cpu_cmpxchg_double(
+					s->cpu_slab->freelist, s->cpu_slab->tid,
+					object, tid,
+					next_object, next_tid(tid)))) {
+
+				note_cmpxchg_failure("slab_alloc", s, tid);
+				goto redo;
+			}
+		} else {
+			// HACK - On a NON-PREEMPT cmpxchg is not necessary(?)
+			__this_cpu_write(s->cpu_slab->tid, next_tid(tid));
+			__this_cpu_write(s->cpu_slab->freelist, next_object);
+		/*
+		 * Q: What happens in-case called from interrupt handler?
+		 *
+		 * If we need to disable (local) IRQs then most of the
+		 * saving is lost.  E.g. the local_irq_{save,restore}
+		 * is too costly.
+		 *
+		 * Saved (alloc+free): 18 cycles - 7.157ns
+		 *
+		 * Cost of (CPU E5-2695):
+		 *  local_irq_{disable,enable}:  7 cycles(tsc) -  2.861 ns
+		 *  local_irq_{save,restore}  : 37 cycles(tsc) - 14.846 ns
+		 */
 		}
 		prefetch_freepointer(s, next_object);
 		stat(s, ALLOC_FASTPATH);
@@ -2726,14 +2745,21 @@ redo:
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);
 
-		if (unlikely(!this_cpu_cmpxchg_double(
-				s->cpu_slab->freelist, s->cpu_slab->tid,
-				c->freelist, tid,
-				object, next_tid(tid)))) {
+		if (IS_ENABLED(CONFIG_PREEMPT)) {
+			if (unlikely(!this_cpu_cmpxchg_double(
+					s->cpu_slab->freelist, s->cpu_slab->tid,
+					c->freelist, tid,
+					object, next_tid(tid)))) {
 
-			note_cmpxchg_failure("slab_free", s, tid);
-			goto redo;
+				note_cmpxchg_failure("slab_free", s, tid);
+				goto redo;
+			}
+		} else {
+			// HACK - On a NON-PREEMPT cmpxchg is not necessary(?)
+			__this_cpu_write(s->cpu_slab->tid, next_tid(tid));
+			__this_cpu_write(s->cpu_slab->freelist, object);
 		}
+
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT
  2015-06-04 10:31 [RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT Jesper Dangaard Brouer
@ 2015-06-05  2:37 ` Eric Dumazet
  2015-06-08  9:23   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2015-06-05  2:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Christoph Lameter, Joonsoo Kim, Alexander Duyck, linux-mm, netdev

On Thu, 2015-06-04 at 12:31 +0200, Jesper Dangaard Brouer wrote:
> This patch improves performance of SLUB allocator fastpath with 38% by
> avoiding the call to this_cpu_cmpxchg_double() for NO-PREEMPT kernels.
> 
> Reviewers please point out why this change is wrong, as such a large
> improvement should not be possible ;-)

I am not sure if anyone already answered, but the cmpxchg_double()
is needed to avoid the ABA problem.

This is the whole point using tid _and_ freelist

Preemption is not the only thing that could happen here, think of
interrupts.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT
  2015-06-05  2:37 ` Eric Dumazet
@ 2015-06-08  9:23   ` Jesper Dangaard Brouer
  2015-06-08  9:39     ` Christoph Lameter
  0 siblings, 1 reply; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-08  9:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Joonsoo Kim, Alexander Duyck, linux-mm, netdev,
	brouer

On Thu, 04 Jun 2015 19:37:57 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2015-06-04 at 12:31 +0200, Jesper Dangaard Brouer wrote:
> > This patch improves performance of SLUB allocator fastpath with 38% by
> > avoiding the call to this_cpu_cmpxchg_double() for NO-PREEMPT kernels.
> > 
> > Reviewers please point out why this change is wrong, as such a large
> > improvement should not be possible ;-)
> 
> I am not sure if anyone already answered, but the cmpxchg_double()
> is needed to avoid the ABA problem.
> 
> This is the whole point using tid _and_ freelist
> 
> Preemption is not the only thing that could happen here, think of
> interrupts.

Yes, I sort of already knew this.

My real question is if disabling local interrupts is enough to avoid this?

And, does local irq disabling also stop preemption?

Questions relate to this patch:
 http://ozlabs.org/~akpm/mmots/broken-out/slub-bulk-alloc-extract-objects-from-the-per-cpu-slab.patch

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT
  2015-06-08  9:23   ` Jesper Dangaard Brouer
@ 2015-06-08  9:39     ` Christoph Lameter
  2015-06-08  9:58       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Lameter @ 2015-06-08  9:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Joonsoo Kim, Alexander Duyck, linux-mm, netdev

On Mon, 8 Jun 2015, Jesper Dangaard Brouer wrote:

> My real question is if disabling local interrupts is enough to avoid this?

Yes the initial release of slub used interrupt disable in the fast paths.

> And, does local irq disabling also stop preemption?

Of course.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT
  2015-06-08  9:39     ` Christoph Lameter
@ 2015-06-08  9:58       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2015-06-08  9:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Joonsoo Kim, Alexander Duyck, linux-mm, netdev,
	brouer

On Mon, 8 Jun 2015 04:39:38 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Mon, 8 Jun 2015, Jesper Dangaard Brouer wrote:
> 
> > My real question is if disabling local interrupts is enough to avoid this?
> 
> Yes the initial release of slub used interrupt disable in the fast paths.

Thanks for the confirmation.

For this code path we would need the save/restore variant, which is
more expensive than the local cmpxchg16b.   In case of bulking, we
should be able to use the less expensive local_irq_{disable,enable}.

Cost of local IRQ toggling (CPU E5-2695):
 *  local_irq_{disable,enable}:  7 cycles(tsc) -  2.861 ns
 *  local_irq_{save,restore}  : 37 cycles(tsc) - 14.846 ns

p.s. I'm back working on bulking API...

> > And, does local irq disabling also stop preemption?
> 
> Of course.

Thanks for confirming this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-06-08  9:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-04 10:31 [RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT Jesper Dangaard Brouer
2015-06-05  2:37 ` Eric Dumazet
2015-06-08  9:23   ` Jesper Dangaard Brouer
2015-06-08  9:39     ` Christoph Lameter
2015-06-08  9:58       ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).