Re: [PATCH 3/3] slub: build detached freelist with look-ahead

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: linux-mm@kvack.org, Christoph Lameter <cl@linux.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Hannes Frederic Sowa <hannes@stressinduktion.org>,
	brouer@redhat.com
Subject: Re: [PATCH 3/3] slub: build detached freelist with look-ahead
Date: Mon, 20 Jul 2015 23:28:17 +0200	[thread overview]
Message-ID: <20150720232817.05f08663@redhat.com> (raw)
In-Reply-To: <20150720025415.GA21760@js1304-P5Q-DELUXE>

On Mon, 20 Jul 2015 11:54:15 +0900
Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> On Thu, Jul 16, 2015 at 11:57:56AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > On Wed, 15 Jul 2015 18:02:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > 
> > > Results:
> > [...]
> > > bulk- Fallback                  - Bulk API
> > >   1 -  64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6%
> > >   2 -  57 cycles(tsc) 14.397 ns - 29 cycles(tsc)  7.368 - improved 49.1%
> > >   3 -  55 cycles(tsc) 13.797 ns - 24 cycles(tsc)  6.003 - improved 56.4%
> > >   4 -  53 cycles(tsc) 13.500 ns - 22 cycles(tsc)  5.543 - improved 58.5%
> > >   8 -  52 cycles(tsc) 13.008 ns - 20 cycles(tsc)  5.047 - improved 61.5%
> > >  16 -  51 cycles(tsc) 12.763 ns - 20 cycles(tsc)  5.015 - improved 60.8%
> > >  30 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.062 - improved 60.0%
> > >  32 -  51 cycles(tsc) 12.908 ns - 20 cycles(tsc)  5.089 - improved 60.8%
> > >  34 -  87 cycles(tsc) 21.936 ns - 28 cycles(tsc)  7.006 - improved 67.8%
> > >  48 -  79 cycles(tsc) 19.840 ns - 31 cycles(tsc)  7.755 - improved 60.8%
> > >  64 -  86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9%
> > > 128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7%
> > > 158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8%
> > > 250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6%
> > 
> > 
> > Something interesting happens, when I'm tuning the SLAB/slub cache...
> > 
> > I was thinking what happens if I "give" the slub more per CPU partial
> > pages.  In my benchmark 250 is my "max" bulk working set.
> > 
> > Tuning SLAB/slub for 256 bytes object size, by tuning SLUB saying each
> > CPU partial should be allowed to contain 256 objects (cpu_partial).
> > 
> >  sudo sh -c 'echo 256 > /sys/kernel/slab/:t-0000256/cpu_partial'
> > 
> > And adjusting 'min_partial' affects __slab_free() by avoiding removing
> > partial if node->nr_partial >= s->min_partial.  Thus, in our test
> > min_partial=9 result in keeping 9 pages 32 * 9 = 288 objects in the
> > 
> >  sudo sh -c 'echo 9   > /sys/kernel/slab/:t-0000256/min_partial'
> >  sudo grep -H . /sys/kernel/slab/:t-0000256/*
> > 
> > First notice the normal fastpath is: 47 cycles(tsc) 11.894 ns
> > 
> > Patch03-TUNED-run01:
> > bulk-  Fallback                 - Bulk-API
> >   1 -  63 cycles(tsc) 15.866 ns - 46 cycles(tsc) 11.653 ns - improved 27.0%
> >   2 -  56 cycles(tsc) 14.137 ns - 28 cycles(tsc)  7.106 ns - improved 50.0%
> >   3 -  54 cycles(tsc) 13.623 ns - 23 cycles(tsc)  5.845 ns - improved 57.4%
> >   4 -  53 cycles(tsc) 13.345 ns - 21 cycles(tsc)  5.316 ns - improved 60.4%
> >   8 -  51 cycles(tsc) 12.960 ns - 20 cycles(tsc)  5.187 ns - improved 60.8%
> >  16 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.091 ns - improved 60.0%
> >  30 -  80 cycles(tsc) 20.153 ns - 28 cycles(tsc)  7.054 ns - improved 65.0%
> >  32 -  82 cycles(tsc) 20.621 ns - 33 cycles(tsc)  8.392 ns - improved 59.8%
> >  34 -  80 cycles(tsc) 20.125 ns - 32 cycles(tsc)  8.046 ns - improved 60.0%
> >  48 -  91 cycles(tsc) 22.887 ns - 30 cycles(tsc)  7.655 ns - improved 67.0%
> >  64 -  85 cycles(tsc) 21.362 ns - 36 cycles(tsc)  9.141 ns - improved 57.6%
> > 128 - 101 cycles(tsc) 25.481 ns - 33 cycles(tsc)  8.286 ns - improved 67.3%
> > 158 - 103 cycles(tsc) 25.909 ns - 36 cycles(tsc)  9.179 ns - improved 65.0%
> > 250 - 105 cycles(tsc) 26.481 ns - 39 cycles(tsc)  9.994 ns - improved 62.9%
> > 
> > Notice how ALL of the bulk sizes now are faster than the 47 cycles of
> > the normal slub fastpath.  This is amazing!
> > 
> > A little strangely, the tuning didn't seem to help the fallback version.
> 
> Hello,
> 
> Looks very nice.

Thanks :-)

> I have some questions about your benchmark and result.
> 
> 1. Does the slab is merged?
> - Your above result shows that fallback bulk for 30, 32 takes longer
>   than fallback bulk for 16. This is strange result because fallback
>   bulk allocation/free for 16, 30, 32 should happens only on cpu cache.

I guess it depends on how "used/full" the page is... some other
subsystem can hold on to objects...

>   If the slab is merged, you should turn off merging to get precise
>   result.

Yes, I think it is merged... how do I turn off merging?

Before adjusting/tuning the SLAB.

$ sudo grep -H . /sys/kernel/slab/:t-0000256/{cpu_partial,min_partial,order,objs_per_slab}
/sys/kernel/slab/:t-0000256/cpu_partial:13
/sys/kernel/slab/:t-0000256/min_partial:5
/sys/kernel/slab/:t-0000256/order:1
/sys/kernel/slab/:t-0000256/objs_per_slab:32

Run01: non-tuned
1 - 64 cycles(tsc) 16.092 ns -  47 cycles(tsc) 11.886 ns
2 - 57 cycles(tsc) 14.258 ns -  28 cycles(tsc) 7.226 ns
3 - 54 cycles(tsc) 13.626 ns -  23 cycles(tsc) 5.822 ns
4 - 53 cycles(tsc) 13.328 ns -  20 cycles(tsc) 5.185 ns
8 - 93 cycles(tsc) 23.301 ns -  49 cycles(tsc) 12.406 ns
16 - 83 cycles(tsc) 20.902 ns -  37 cycles(tsc) 9.418 ns
30 - 77 cycles(tsc) 19.400 ns -  30 cycles(tsc) 7.748 ns
32 - 79 cycles(tsc) 19.938 ns -  30 cycles(tsc) 7.751 ns
34 - 80 cycles(tsc) 20.215 ns -  35 cycles(tsc) 8.907 ns
48 - 85 cycles(tsc) 21.391 ns -  24 cycles(tsc) 6.219 ns
64 - 93 cycles(tsc) 23.272 ns -  67 cycles(tsc) 16.874 ns
128 - 101 cycles(tsc) 25.407 ns -  72 cycles(tsc) 18.097 ns
158 - 105 cycles(tsc) 26.319 ns -  72 cycles(tsc) 18.164 ns
250 - 107 cycles(tsc) 26.783 ns -  72 cycles(tsc) 18.246 ns

Run02: non-tuned
1 - 63 cycles(tsc) 15.864 ns -  46 cycles(tsc) 11.672 ns
2 - 56 cycles(tsc) 14.153 ns -  28 cycles(tsc) 7.119 ns
3 - 54 cycles(tsc) 13.681 ns -  23 cycles(tsc) 5.846 ns
4 - 53 cycles(tsc) 13.354 ns -  20 cycles(tsc) 5.141 ns
8 - 51 cycles(tsc) 12.970 ns -  19 cycles(tsc) 4.954 ns
16 - 51 cycles(tsc) 12.763 ns -  20 cycles(tsc) 5.003 ns
30 - 51 cycles(tsc) 12.760 ns -  20 cycles(tsc) 5.065 ns
32 - 80 cycles(tsc) 20.045 ns -  37 cycles(tsc) 9.311 ns
34 - 73 cycles(tsc) 18.454 ns -  27 cycles(tsc) 6.773 ns
48 - 82 cycles(tsc) 20.544 ns -  35 cycles(tsc) 8.973 ns
64 - 87 cycles(tsc) 21.809 ns -  60 cycles(tsc) 15.167 ns
128 - 103 cycles(tsc) 25.772 ns -  63 cycles(tsc) 15.874 ns
158 - 104 cycles(tsc) 26.215 ns -  61 cycles(tsc) 15.433 ns
250 - 107 cycles(tsc) 26.926 ns -  60 cycles(tsc) 15.058 ns

Notice the variation is fairly high between runs... :-(

> 3. For more precise test setup, how about setting cpu affinity?

Sure, starting to use test cmd:
 sudo taskset -c 1 modprobe slab_bulk_test01 && rmmod slab_bulk_test01 && sudo dmesg

Code:
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

For these runs I've also disabled HT (Hyper Threading) in the BIOS, as
this tuned out to be a big disturbance for my network testing use-case.
(ps. I've hacked together a use-case in ixgbe/skbuff.c, but only TX complete
bulk-free which shows improvement of 3ns and 16ns with this slab
tuning, once I also implement alloc-bulk I should get a better boost).


> 2. Could you show result with only tuning min_partial?
> - I guess that much improvement for Bulk-API comes from disappearing
>   slab page allocation/free cost rather than tuning cpu_partial.

Sure, there are some more runs:

  sudo sh -c 'echo 9   > /sys/kernel/slab/:t-0000256/min_partial'

Run03: tuned min_partial=9
1 - 63 cycles(tsc) 15.910 ns -  46 cycles(tsc) 11.720 ns
2 - 57 cycles(tsc) 14.318 ns -  29 cycles(tsc) 7.266 ns
3 - 55 cycles(tsc) 13.762 ns -  23 cycles(tsc) 5.937 ns
4 - 53 cycles(tsc) 13.459 ns -  20 cycles(tsc) 5.211 ns
8 - 51 cycles(tsc) 13.001 ns -  19 cycles(tsc) 4.821 ns
16 - 51 cycles(tsc) 12.772 ns -  20 cycles(tsc) 5.016 ns
30 - 84 cycles(tsc) 21.135 ns -  28 cycles(tsc) 7.047 ns
32 - 83 cycles(tsc) 20.887 ns -  28 cycles(tsc) 7.133 ns
34 - 81 cycles(tsc) 20.454 ns -  28 cycles(tsc) 7.024 ns
48 - 86 cycles(tsc) 21.662 ns -  32 cycles(tsc) 8.121 ns
64 - 92 cycles(tsc) 23.027 ns -  52 cycles(tsc) 13.033 ns
128 - 97 cycles(tsc) 24.270 ns -  51 cycles(tsc) 12.865 ns
158 - 105 cycles(tsc) 26.290 ns -  53 cycles(tsc) 13.435 ns
250 - 106 cycles(tsc) 26.545 ns -  54 cycles(tsc) 13.607 ns

Run04: tuned min_partial=9
1 - 64 cycles(tsc) 16.123 ns -  47 cycles(tsc) 11.906 ns
2 - 57 cycles(tsc) 14.267 ns -  28 cycles(tsc) 7.235 ns
3 - 54 cycles(tsc) 13.691 ns -  23 cycles(tsc) 5.916 ns
4 - 53 cycles(tsc) 13.470 ns -  21 cycles(tsc) 5.278 ns
8 - 51 cycles(tsc) 12.991 ns -  19 cycles(tsc) 4.815 ns
16 - 50 cycles(tsc) 12.651 ns -  19 cycles(tsc) 4.840 ns
30 - 81 cycles(tsc) 20.282 ns -  35 cycles(tsc) 8.835 ns
32 - 77 cycles(tsc) 19.327 ns -  29 cycles(tsc) 7.403 ns
34 - 77 cycles(tsc) 19.438 ns -  31 cycles(tsc) 7.879 ns
48 - 85 cycles(tsc) 21.367 ns -  34 cycles(tsc) 8.563 ns
64 - 87 cycles(tsc) 21.830 ns -  55 cycles(tsc) 13.820 ns
128 - 109 cycles(tsc) 27.445 ns -  56 cycles(tsc) 14.152 ns
158 - 102 cycles(tsc) 25.576 ns -  60 cycles(tsc) 15.120 ns
250 - 108 cycles(tsc) 27.069 ns -  58 cycles(tsc) 14.534 ns

Looking at Run04 the win was not so big...

Also adjust cpu_partial:
 sudo sh -c 'echo 256 > /sys/kernel/slab/:t-0000256/cpu_partial'

$ sudo grep -H . /sys/kernel/slab/:t-0000256/{cpu_partial,min_partial,order,objs_per_slab}
/sys/kernel/slab/:t-0000256/cpu_partial:256
/sys/kernel/slab/:t-0000256/min_partial:9
/sys/kernel/slab/:t-0000256/order:1
/sys/kernel/slab/:t-0000256/objs_per_slab:32

Run05: also tuned cpu_partial=256
1 - 63 cycles(tsc) 15.867 ns -  46 cycles(tsc) 11.656 ns
2 - 56 cycles(tsc) 14.229 ns -  28 cycles(tsc) 7.131 ns
3 - 54 cycles(tsc) 13.587 ns -  23 cycles(tsc) 5.760 ns
4 - 53 cycles(tsc) 13.287 ns -  20 cycles(tsc) 5.081 ns
8 - 51 cycles(tsc) 12.935 ns -  19 cycles(tsc) 4.953 ns
16 - 50 cycles(tsc) 12.707 ns -  20 cycles(tsc) 5.074 ns
30 - 79 cycles(tsc) 19.927 ns -  28 cycles(tsc) 7.057 ns
32 - 79 cycles(tsc) 19.977 ns -  31 cycles(tsc) 7.762 ns
34 - 79 cycles(tsc) 19.800 ns -  33 cycles(tsc) 8.392 ns
48 - 93 cycles(tsc) 23.316 ns -  35 cycles(tsc) 8.777 ns
64 - 92 cycles(tsc) 23.144 ns -  33 cycles(tsc) 8.449 ns
128 - 97 cycles(tsc) 24.268 ns -  35 cycles(tsc) 8.943 ns
158 - 106 cycles(tsc) 26.606 ns -  40 cycles(tsc) 10.067 ns
250 - 109 cycles(tsc) 27.385 ns -  51 cycles(tsc) 12.957 ns

Run06: also tuned cpu_partial=256
1 - 63 cycles(tsc) 15.952 ns -  46 cycles(tsc) 11.710 ns
2 - 57 cycles(tsc) 14.309 ns -  29 cycles(tsc) 7.261 ns
3 - 54 cycles(tsc) 13.703 ns -  23 cycles(tsc) 5.858 ns
4 - 53 cycles(tsc) 13.394 ns -  20 cycles(tsc) 5.161 ns
8 - 52 cycles(tsc) 13.013 ns -  19 cycles(tsc) 4.809 ns
16 - 94 cycles(tsc) 23.734 ns -  49 cycles(tsc) 12.376 ns
30 - 88 cycles(tsc) 22.221 ns -  35 cycles(tsc) 8.933 ns
32 - 101 cycles(tsc) 25.319 ns -  41 cycles(tsc) 10.437 ns
34 - 98 cycles(tsc) 24.711 ns -  41 cycles(tsc) 10.485 ns
48 - 96 cycles(tsc) 24.119 ns -  41 cycles(tsc) 10.479 ns
64 - 100 cycles(tsc) 25.223 ns -  39 cycles(tsc) 9.766 ns
128 - 100 cycles(tsc) 25.078 ns -  34 cycles(tsc) 8.602 ns
158 - 102 cycles(tsc) 25.673 ns -  38 cycles(tsc) 9.645 ns
250 - 110 cycles(tsc) 27.560 ns -  40 cycles(tsc) 10.046 ns

(p.s. I'm currently on vacation for 3 weeks...)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-07-20 21:28 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-15 16:01 [PATCH 0/3] slub: introducing detached freelist Jesper Dangaard Brouer
2015-07-15 16:01 ` [PATCH 1/3] slub: extend slowpath __slab_free() to handle bulk free Jesper Dangaard Brouer
2015-07-15 16:54   ` Christoph Lameter
2015-07-15 16:02 ` [PATCH 2/3] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
2015-07-15 16:56   ` Christoph Lameter
2015-07-15 16:02 ` [PATCH 3/3] slub: build detached freelist with look-ahead Jesper Dangaard Brouer
2015-07-16  9:57   ` Jesper Dangaard Brouer
2015-07-20  2:54     ` Joonsoo Kim
2015-07-20 21:28       ` Jesper Dangaard Brouer [this message]
2015-07-21 13:50         ` Christoph Lameter
2015-07-21 23:28           ` Jesper Dangaard Brouer
2015-07-23  6:34             ` Joonsoo Kim
2015-07-23 11:09               ` Jesper Dangaard Brouer
2015-07-23 14:14                 ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150720232817.05f08663@redhat.com \
    --to=brouer@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=cl@linux.com \
    --cc=hannes@stressinduktion.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.