netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: slab-nomerge (was Re: [git pull] device mapper changes for 4.3)
       [not found]           ` <CA+55aFzBTL=DnC4zv6yxjk0HxwxWpOhpKDPA8zkTGdgbh08sEg@mail.gmail.com>
@ 2015-09-07  9:30             ` Jesper Dangaard Brouer
  2015-09-07 20:22               ` Linus Torvalds
  0 siblings, 1 reply; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07  9:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: brouer, Dave Chinner, Mike Snitzer, Christoph Lameter,
	Pekka Enberg, Andrew Morton, David Rientjes, Joonsoo Kim,
	dm-devel@redhat.com, Alasdair G Kergon, Joe Thornber,
	Mikulas Patocka, Vivek Goyal, Sami Tolvanen, Viresh Kumar,
	Heinz Mauelshagen, linux-mm, netdev@vger.kernel.org


On Thu, 3 Sep 2015 20:51:09 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Sep 3, 2015 at 8:26 PM, Dave Chinner <dchinner@redhat.com> wrote:
> >
> > The double standard is the problem here. No notification, proof,
> > discussion or review was needed to turn on slab merging for
> > everyone, but you're setting a very high bar to jump if anyone wants
> > to turn it off in their code.
> 
> Ehh. You realize that almost the only load that is actually seriously
> allocator-limited is networking?
> 
> And slub was beating slab on that? And slub has been doing the merging
> since day one. Slab was just changed to try to keep up with the
> winning strategy.

Sorry, I have to correct you on this.  The slub allocator is not as
fast as you might think.  The slab allocator is actually faster for
networking.

IP-forwarding, single CPU, single flow UDP (highly tuned):
 * Allocator slub: 2043575 pps
 * Allocator slab: 2088295 pps

Difference slab faster than slub:
 * +44720 pps and -10.48ns

The slub allocator have a faster "fastpath", if your workload is
fast-reusing within the same per-cpu page-slab, but once the workload
increases you hit the slowpath, and then slab catches up. Slub looks
great in micro-benchmarking.


As you can see in patchset:
 [PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
 http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=376625

I'm working on speeding up slub to the level of slab.  And it seems
like I have succeeded with half-a-nanosec 2090522 pps (+2227 pps or
0.51 ns).

And with "slab_nomerge" I get even high performance:
 * slub: bulk-free and slab_nomerge: 2121824 pps
 * Diff to slub: +78249 and -18.05ns

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: slab-nomerge (was Re: [git pull] device mapper changes for 4.3)
  2015-09-07  9:30             ` slab-nomerge (was Re: [git pull] device mapper changes for 4.3) Jesper Dangaard Brouer
@ 2015-09-07 20:22               ` Linus Torvalds
  2015-09-07 21:17                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 3+ messages in thread
From: Linus Torvalds @ 2015-09-07 20:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Dave Chinner, Mike Snitzer, Christoph Lameter, Pekka Enberg,
	Andrew Morton, David Rientjes, Joonsoo Kim, dm-devel@redhat.com,
	Alasdair G Kergon, Joe Thornber, Mikulas Patocka, Vivek Goyal,
	Sami Tolvanen, Viresh Kumar, Heinz Mauelshagen, linux-mm,
	netdev@vger.kernel.org

On Mon, Sep 7, 2015 at 2:30 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> The slub allocator have a faster "fastpath", if your workload is
> fast-reusing within the same per-cpu page-slab, but once the workload
> increases you hit the slowpath, and then slab catches up. Slub looks
> great in micro-benchmarking.
>
> And with "slab_nomerge" I get even high performance:

I think those two are related.

Not merging means that effectively the percpu caches end up being
bigger (simply because there are more of them), and so it captures
more of the fastpath cases.

Obviously the percpu queue size is an easy tunable too, but there are
real downsides to that too. I suspect your IP forwarding case isn't so
different from some of the microbenchmarks, it just has more
outstanding work..

And yes, the slow path (ie not hitting in the percpu cache) of SLUB
could hopefully be optimizable too, although maybe the bulk patches
are the way to go (and unrelated to this thread - at least part of
your bulk patches actually got merged last Friday - they were part of
Andrew's patch-bomb).

            Linus

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: slab-nomerge (was Re: [git pull] device mapper changes for 4.3)
  2015-09-07 20:22               ` Linus Torvalds
@ 2015-09-07 21:17                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07 21:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Mike Snitzer, Christoph Lameter, Pekka Enberg,
	Andrew Morton, David Rientjes, Joonsoo Kim, dm-devel@redhat.com,
	Alasdair G Kergon, Joe Thornber, Mikulas Patocka, Vivek Goyal,
	Sami Tolvanen, Viresh Kumar, Heinz Mauelshagen, linux-mm,
	netdev@vger.kernel.org, brouer

On Mon, 7 Sep 2015 13:22:13 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Sep 7, 2015 at 2:30 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > The slub allocator have a faster "fastpath", if your workload is
> > fast-reusing within the same per-cpu page-slab, but once the workload
> > increases you hit the slowpath, and then slab catches up. Slub looks
> > great in micro-benchmarking.
> >
> > And with "slab_nomerge" I get even high performance:
> 
> I think those two are related.
> 
> Not merging means that effectively the percpu caches end up being
> bigger (simply because there are more of them), and so it captures
> more of the fastpath cases.

Yes, that was also my theory.  As manually tuning the percpu sizes gave
me almost the same boost.


> Obviously the percpu queue size is an easy tunable too, but there are
> real downsides to that too. 

The easy fix is to introduce a subsystem specific percpu cache that is
large enough for our use-case.  That seems to be a trend. I'm hoping to
come up with something smarter that every subsystem can benefit from. 
E.g some heuristic that can dynamic adjust SLUB according to the usage
pattern. I can imagine something as simple as a counter for every
slowpath call, that is only valid as long as the jiffies count matches
(reset to zero, and store new jiffies cnt).  (But I have not thought
this through...)


> I suspect your IP forwarding case isn't so
> different from some of the microbenchmarks, it just has more
> outstanding work..

Yes, I will admit that my testing is very close to micro benchmarking,
and it is specifically designed to pressure the system to its limits[1].
Especially the minimum frame size is evil and unrealistic, but the real
purpose is preparing the stack for increasing speeds like 100Gbit/s.


> And yes, the slow path (ie not hitting in the percpu cache) of SLUB
> could hopefully be optimizable too, although maybe the bulk patches
> are the way to go (and unrelated to this thread - at least part of
> your bulk patches actually got merged last Friday - they were part of
> Andrew's patch-bomb).

Cool. Yes, it is only part of the bulk patches. The real performance
boosters are not in yet (but I need to make them work correctly with
memory debugging enabled before they can get merged).  At least the
main API is in, which allows me to implement use-case easier in other
subsystems :-)

[1] http://netoptimizer.blogspot.dk/2014/09/packet-per-sec-measurements-for.html
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-09-07 21:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CA+55aFyepmdpbg9U2Pvp+aHjKmmGCrTK2ywzqfmaOTMXQasYNw@mail.gmail.com>
     [not found] ` <20150903005115.GA27804@redhat.com>
     [not found]   ` <CA+55aFxpH6-XD97dOsuGvwozyV=28eBsxiKS901h8PFZrxaygw@mail.gmail.com>
     [not found]     ` <20150903060247.GV1933@devil.localdomain>
     [not found]       ` <CA+55aFxftNVWVD7ujseqUDNgbVamrFWf0PVM+hPnrfmmACgE0Q@mail.gmail.com>
     [not found]         ` <20150904032607.GX1933@devil.localdomain>
     [not found]           ` <CA+55aFzBTL=DnC4zv6yxjk0HxwxWpOhpKDPA8zkTGdgbh08sEg@mail.gmail.com>
2015-09-07  9:30             ` slab-nomerge (was Re: [git pull] device mapper changes for 4.3) Jesper Dangaard Brouer
2015-09-07 20:22               ` Linus Torvalds
2015-09-07 21:17                 ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).