* Re: slab-nomerge (was Re: [git pull] device mapper changes for 4.3) [not found] ` <CA+55aFzBTL=DnC4zv6yxjk0HxwxWpOhpKDPA8zkTGdgbh08sEg@mail.gmail.com> @ 2015-09-07 9:30 ` Jesper Dangaard Brouer 2015-09-07 20:22 ` Linus Torvalds 0 siblings, 1 reply; 3+ messages in thread From: Jesper Dangaard Brouer @ 2015-09-07 9:30 UTC (permalink / raw) To: Linus Torvalds Cc: brouer, Dave Chinner, Mike Snitzer, Christoph Lameter, Pekka Enberg, Andrew Morton, David Rientjes, Joonsoo Kim, dm-devel@redhat.com, Alasdair G Kergon, Joe Thornber, Mikulas Patocka, Vivek Goyal, Sami Tolvanen, Viresh Kumar, Heinz Mauelshagen, linux-mm, netdev@vger.kernel.org On Thu, 3 Sep 2015 20:51:09 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Sep 3, 2015 at 8:26 PM, Dave Chinner <dchinner@redhat.com> wrote: > > > > The double standard is the problem here. No notification, proof, > > discussion or review was needed to turn on slab merging for > > everyone, but you're setting a very high bar to jump if anyone wants > > to turn it off in their code. > > Ehh. You realize that almost the only load that is actually seriously > allocator-limited is networking? > > And slub was beating slab on that? And slub has been doing the merging > since day one. Slab was just changed to try to keep up with the > winning strategy. Sorry, I have to correct you on this. The slub allocator is not as fast as you might think. The slab allocator is actually faster for networking. IP-forwarding, single CPU, single flow UDP (highly tuned): * Allocator slub: 2043575 pps * Allocator slab: 2088295 pps Difference slab faster than slub: * +44720 pps and -10.48ns The slub allocator have a faster "fastpath", if your workload is fast-reusing within the same per-cpu page-slab, but once the workload increases you hit the slowpath, and then slab catches up. Slub looks great in micro-benchmarking. As you can see in patchset: [PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API. http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=376625 I'm working on speeding up slub to the level of slab. And it seems like I have succeeded with half-a-nanosec 2090522 pps (+2227 pps or 0.51 ns). And with "slab_nomerge" I get even high performance: * slub: bulk-free and slab_nomerge: 2121824 pps * Diff to slub: +78249 and -18.05ns -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: slab-nomerge (was Re: [git pull] device mapper changes for 4.3) 2015-09-07 9:30 ` slab-nomerge (was Re: [git pull] device mapper changes for 4.3) Jesper Dangaard Brouer @ 2015-09-07 20:22 ` Linus Torvalds 2015-09-07 21:17 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 3+ messages in thread From: Linus Torvalds @ 2015-09-07 20:22 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Dave Chinner, Mike Snitzer, Christoph Lameter, Pekka Enberg, Andrew Morton, David Rientjes, Joonsoo Kim, dm-devel@redhat.com, Alasdair G Kergon, Joe Thornber, Mikulas Patocka, Vivek Goyal, Sami Tolvanen, Viresh Kumar, Heinz Mauelshagen, linux-mm, netdev@vger.kernel.org On Mon, Sep 7, 2015 at 2:30 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > The slub allocator have a faster "fastpath", if your workload is > fast-reusing within the same per-cpu page-slab, but once the workload > increases you hit the slowpath, and then slab catches up. Slub looks > great in micro-benchmarking. > > And with "slab_nomerge" I get even high performance: I think those two are related. Not merging means that effectively the percpu caches end up being bigger (simply because there are more of them), and so it captures more of the fastpath cases. Obviously the percpu queue size is an easy tunable too, but there are real downsides to that too. I suspect your IP forwarding case isn't so different from some of the microbenchmarks, it just has more outstanding work.. And yes, the slow path (ie not hitting in the percpu cache) of SLUB could hopefully be optimizable too, although maybe the bulk patches are the way to go (and unrelated to this thread - at least part of your bulk patches actually got merged last Friday - they were part of Andrew's patch-bomb). Linus ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: slab-nomerge (was Re: [git pull] device mapper changes for 4.3) 2015-09-07 20:22 ` Linus Torvalds @ 2015-09-07 21:17 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 3+ messages in thread From: Jesper Dangaard Brouer @ 2015-09-07 21:17 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Mike Snitzer, Christoph Lameter, Pekka Enberg, Andrew Morton, David Rientjes, Joonsoo Kim, dm-devel@redhat.com, Alasdair G Kergon, Joe Thornber, Mikulas Patocka, Vivek Goyal, Sami Tolvanen, Viresh Kumar, Heinz Mauelshagen, linux-mm, netdev@vger.kernel.org, brouer On Mon, 7 Sep 2015 13:22:13 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Sep 7, 2015 at 2:30 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > > > The slub allocator have a faster "fastpath", if your workload is > > fast-reusing within the same per-cpu page-slab, but once the workload > > increases you hit the slowpath, and then slab catches up. Slub looks > > great in micro-benchmarking. > > > > And with "slab_nomerge" I get even high performance: > > I think those two are related. > > Not merging means that effectively the percpu caches end up being > bigger (simply because there are more of them), and so it captures > more of the fastpath cases. Yes, that was also my theory. As manually tuning the percpu sizes gave me almost the same boost. > Obviously the percpu queue size is an easy tunable too, but there are > real downsides to that too. The easy fix is to introduce a subsystem specific percpu cache that is large enough for our use-case. That seems to be a trend. I'm hoping to come up with something smarter that every subsystem can benefit from. E.g some heuristic that can dynamic adjust SLUB according to the usage pattern. I can imagine something as simple as a counter for every slowpath call, that is only valid as long as the jiffies count matches (reset to zero, and store new jiffies cnt). (But I have not thought this through...) > I suspect your IP forwarding case isn't so > different from some of the microbenchmarks, it just has more > outstanding work.. Yes, I will admit that my testing is very close to micro benchmarking, and it is specifically designed to pressure the system to its limits[1]. Especially the minimum frame size is evil and unrealistic, but the real purpose is preparing the stack for increasing speeds like 100Gbit/s. > And yes, the slow path (ie not hitting in the percpu cache) of SLUB > could hopefully be optimizable too, although maybe the bulk patches > are the way to go (and unrelated to this thread - at least part of > your bulk patches actually got merged last Friday - they were part of > Andrew's patch-bomb). Cool. Yes, it is only part of the bulk patches. The real performance boosters are not in yet (but I need to make them work correctly with memory debugging enabled before they can get merged). At least the main API is in, which allows me to implement use-case easier in other subsystems :-) [1] http://netoptimizer.blogspot.dk/2014/09/packet-per-sec-measurements-for.html -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2015-09-07 21:17 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CA+55aFyepmdpbg9U2Pvp+aHjKmmGCrTK2ywzqfmaOTMXQasYNw@mail.gmail.com> [not found] ` <20150903005115.GA27804@redhat.com> [not found] ` <CA+55aFxpH6-XD97dOsuGvwozyV=28eBsxiKS901h8PFZrxaygw@mail.gmail.com> [not found] ` <20150903060247.GV1933@devil.localdomain> [not found] ` <CA+55aFxftNVWVD7ujseqUDNgbVamrFWf0PVM+hPnrfmmACgE0Q@mail.gmail.com> [not found] ` <20150904032607.GX1933@devil.localdomain> [not found] ` <CA+55aFzBTL=DnC4zv6yxjk0HxwxWpOhpKDPA8zkTGdgbh08sEg@mail.gmail.com> 2015-09-07 9:30 ` slab-nomerge (was Re: [git pull] device mapper changes for 4.3) Jesper Dangaard Brouer 2015-09-07 20:22 ` Linus Torvalds 2015-09-07 21:17 ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).