Re: [PATCH 0/1] mm: Remove the SLAB allocator

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 0/1] mm: Remove the SLAB allocator
       [not found]     ` <20190412112816.GD18914@techsingularity.net>
@ 2019-04-17  3:52       ` Andrew Morton
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Morton @ 2019-04-17  3:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Vlastimil Babka, Tobin C. Harding,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim, Tejun Heo, Qian Cai,
	Linus Torvalds, linux-mm, linux-kernel, netdev,
	Jesper Dangaard Brouer, David Miller

On Fri, 12 Apr 2019 12:28:16 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Apr 10, 2019 at 02:53:34PM -0700, David Rientjes wrote:
> > > FWIW, our enterprise kernel use it (latest is 4.12 based), and openSUSE
> > > kernels as well (with openSUSE Tumbleweed that includes latest
> > > kernel.org stables). AFAIK we don't enable SLAB_DEBUG even in general
> > > debug kernel flavours as it's just too slow.
> > > 
> > > IIRC last time Mel evaluated switching to SLUB, it wasn't a clear
> > > winner, but I'll just CC him for details :)
> > > 
> > 
> > We also use CONFIG_SLAB and disable CONFIG_SLAB_DEBUG for the same reason.
> 
> Would it be possible to re-evaluate using mainline kernel 5.0?

I have vague memories that slab outperforms slub for some networking
loads.  Could the net folks please comment?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] mm: Remove the SLAB allocator
       [not found]       ` <262df687-c934-b3e2-1d5f-548e8a8acb74@iki.fi>
@ 2019-04-17  8:50         ` Jesper Dangaard Brouer
  2019-04-17 13:27           ` Christopher Lameter
  2019-04-17 13:38           ` Michal Hocko
  0 siblings, 2 replies; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-17  8:50 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Michal Hocko, Tobin C. Harding, Vlastimil Babka, Tobin C. Harding,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Tejun Heo, Qian Cai, Linus Torvalds, linux-mm,
	linux-kernel, Mel Gorman, netdev@vger.kernel.org, Alexander Duyck

On Thu, 11 Apr 2019 11:27:26 +0300
Pekka Enberg <penberg@iki.fi> wrote:

> Hi,
> 
> On 4/11/19 10:55 AM, Michal Hocko wrote:
> > Please please have it more rigorous then what happened when SLUB was
> > forced to become a default  
> 
> This is the hard part.
> 
> Even if you are able to show that SLUB is as fast as SLAB for all the 
> benchmarks you run, there's bound to be that one workload where SLUB 
> regresses. You will then have people complaining about that (rightly so) 
> and you're again stuck with two allocators.
> 
> To move forward, I think we should look at possible *pathological* cases 
> where we think SLAB might have an advantage. For example, SLUB had much 
> more difficulties with remote CPU frees than SLAB. Now I don't know if 
> this is the case, but it should be easy to construct a synthetic 
> benchmark to measure this.

I do think SLUB have a number of pathological cases where SLAB is
faster.  If was significantly more difficult to get good bulk-free
performance for SLUB.  SLUB is only fast as long as objects belong to
the same page.  To get good bulk-free performance if objects are
"mixed", I coded this[1] way-too-complex fast-path code to counter
act this (joined work with Alex Duyck).

[1] https://github.com/torvalds/linux/blob/v5.1-rc5/mm/slub.c#L3033-L3113


> For example, have a userspace process that does networking, which is 
> often memory allocation intensive, so that we know that SKBs traverse 
> between CPUs. You can do this by making sure that the NIC queues are 
> mapped to CPU N (so that network softirqs have to run on that CPU) but 
> the process is pinned to CPU M.

If someone want to test this with SKBs then be-aware that we netdev-guys
have a number of optimizations where we try to counter act this. (As
minimum disable TSO and GRO).

It might also be possible for people to get inspired by and adapt the
micro benchmarking[2] kernel modules that I wrote when developing the
SLUB and SLAB optimizations:

[2] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm


> It's, of course, worth thinking about other pathological cases too. 
> Workloads that cause large allocations is one. Workloads that cause lots 
> of slab cache shrinking is another.

I also worry about long uptimes when SLUB objects/pages gets too
fragmented... as I said SLUB is only efficient when objects are
returned to the same page, while SLAB is not.


I did a comparison of bulk FREE performance here (where SLAB is
slightly faster):
 Commit ca257195511d ("mm: new API kfree_bulk() for SLAB+SLUB allocators")
 [3] https://git.kernel.org/torvalds/c/ca257195511d

You might also notice how simple the SLAB code is:
  Commit e6cdb58d1c83 ("slab: implement bulk free in SLAB allocator")
  [4] https://git.kernel.org/torvalds/c/e6cdb58d1c83


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] mm: Remove the SLAB allocator
  2019-04-17  8:50         ` Jesper Dangaard Brouer
@ 2019-04-17 13:27           ` Christopher Lameter
  2019-04-17 13:38           ` Michal Hocko
  1 sibling, 0 replies; 5+ messages in thread
From: Christopher Lameter @ 2019-04-17 13:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pekka Enberg, Michal Hocko, Tobin C. Harding, Vlastimil Babka,
	Tobin C. Harding, Andrew Morton, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Tejun Heo, Qian Cai, Linus Torvalds, linux-mm,
	linux-kernel, Mel Gorman, netdev@vger.kernel.org, Alexander Duyck

On Wed, 17 Apr 2019, Jesper Dangaard Brouer wrote:

> I do think SLUB have a number of pathological cases where SLAB is
> faster.  If was significantly more difficult to get good bulk-free
> performance for SLUB.  SLUB is only fast as long as objects belong to
> the same page.  To get good bulk-free performance if objects are
> "mixed", I coded this[1] way-too-complex fast-path code to counter
> act this (joined work with Alex Duyck).

Right. SLUB usually compensates for that with superior allocation
performance.

> > It's, of course, worth thinking about other pathological cases too.
> > Workloads that cause large allocations is one. Workloads that cause lots
> > of slab cache shrinking is another.
>
> I also worry about long uptimes when SLUB objects/pages gets too
> fragmented... as I said SLUB is only efficient when objects are
> returned to the same page, while SLAB is not.

??? Why would SLUB pages get more fragmented? SLUB has fragmentation
prevention methods that SLAB does not have.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] mm: Remove the SLAB allocator
  2019-04-17  8:50         ` Jesper Dangaard Brouer
  2019-04-17 13:27           ` Christopher Lameter
@ 2019-04-17 13:38           ` Michal Hocko
  2019-04-22 14:43             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2019-04-17 13:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pekka Enberg, Tobin C. Harding, Vlastimil Babka, Tobin C. Harding,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Tejun Heo, Qian Cai, Linus Torvalds, linux-mm,
	linux-kernel, Mel Gorman, netdev@vger.kernel.org, Alexander Duyck

On Wed 17-04-19 10:50:18, Jesper Dangaard Brouer wrote:
> On Thu, 11 Apr 2019 11:27:26 +0300
> Pekka Enberg <penberg@iki.fi> wrote:
> 
> > Hi,
> > 
> > On 4/11/19 10:55 AM, Michal Hocko wrote:
> > > Please please have it more rigorous then what happened when SLUB was
> > > forced to become a default  
> > 
> > This is the hard part.
> > 
> > Even if you are able to show that SLUB is as fast as SLAB for all the 
> > benchmarks you run, there's bound to be that one workload where SLUB 
> > regresses. You will then have people complaining about that (rightly so) 
> > and you're again stuck with two allocators.
> > 
> > To move forward, I think we should look at possible *pathological* cases 
> > where we think SLAB might have an advantage. For example, SLUB had much 
> > more difficulties with remote CPU frees than SLAB. Now I don't know if 
> > this is the case, but it should be easy to construct a synthetic 
> > benchmark to measure this.
> 
> I do think SLUB have a number of pathological cases where SLAB is
> faster.  If was significantly more difficult to get good bulk-free
> performance for SLUB.  SLUB is only fast as long as objects belong to
> the same page.  To get good bulk-free performance if objects are
> "mixed", I coded this[1] way-too-complex fast-path code to counter
> act this (joined work with Alex Duyck).
> 
> [1] https://github.com/torvalds/linux/blob/v5.1-rc5/mm/slub.c#L3033-L3113

How often is this a real problem for real workloads?

> > For example, have a userspace process that does networking, which is 
> > often memory allocation intensive, so that we know that SKBs traverse 
> > between CPUs. You can do this by making sure that the NIC queues are 
> > mapped to CPU N (so that network softirqs have to run on that CPU) but 
> > the process is pinned to CPU M.
> 
> If someone want to test this with SKBs then be-aware that we netdev-guys
> have a number of optimizations where we try to counter act this. (As
> minimum disable TSO and GRO).
> 
> It might also be possible for people to get inspired by and adapt the
> micro benchmarking[2] kernel modules that I wrote when developing the
> SLUB and SLAB optimizations:
> 
> [2] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm

While microbenchmarks are good to see pathological behavior, I would be
really interested to see some numbers for real world usecases.
 
> > It's, of course, worth thinking about other pathological cases too. 
> > Workloads that cause large allocations is one. Workloads that cause lots 
> > of slab cache shrinking is another.
> 
> I also worry about long uptimes when SLUB objects/pages gets too
> fragmented... as I said SLUB is only efficient when objects are
> returned to the same page, while SLAB is not.

Is this something that has been actually measured in a real deployment?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] mm: Remove the SLAB allocator
  2019-04-17 13:38           ` Michal Hocko
@ 2019-04-22 14:43             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-22 14:43 UTC (permalink / raw)
  To: Michal Hocko, Brendan Gregg
  Cc: brouer, Pekka Enberg, Tobin C. Harding, Vlastimil Babka,
	Tobin C. Harding, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Tejun Heo, Qian Cai, Linus Torvalds,
	linux-mm, linux-kernel, Mel Gorman, netdev@vger.kernel.org,
	Alexander Duyck

On Wed, 17 Apr 2019 15:38:52 +0200
Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 17-04-19 10:50:18, Jesper Dangaard Brouer wrote:
> > On Thu, 11 Apr 2019 11:27:26 +0300
> > Pekka Enberg <penberg@iki.fi> wrote:
> >   
> > > Hi,
> > > 
> > > On 4/11/19 10:55 AM, Michal Hocko wrote:  
> > > > Please please have it more rigorous then what happened when SLUB was
> > > > forced to become a default    
> > > 
> > > This is the hard part.
> > > 
> > > Even if you are able to show that SLUB is as fast as SLAB for all the 
> > > benchmarks you run, there's bound to be that one workload where SLUB 
> > > regresses. You will then have people complaining about that (rightly so) 
> > > and you're again stuck with two allocators.
> > > 
> > > To move forward, I think we should look at possible *pathological* cases 
> > > where we think SLAB might have an advantage. For example, SLUB had much 
> > > more difficulties with remote CPU frees than SLAB. Now I don't know if 
> > > this is the case, but it should be easy to construct a synthetic 
> > > benchmark to measure this.  
> > 
> > I do think SLUB have a number of pathological cases where SLAB is
> > faster.  If was significantly more difficult to get good bulk-free
> > performance for SLUB.  SLUB is only fast as long as objects belong to
> > the same page.  To get good bulk-free performance if objects are
> > "mixed", I coded this[1] way-too-complex fast-path code to counter
> > act this (joined work with Alex Duyck).
> > 
> > [1] https://github.com/torvalds/linux/blob/v5.1-rc5/mm/slub.c#L3033-L3113  
> 
> How often is this a real problem for real workloads?

First let me point out that I have a benchmark[2] that test this
worse-case behavior, and micro-benchmark wise it was a big win.  I did
limit the "lookahead" based on this benchmark balance/bound worse-case
behavior.

 [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test03.c#L4-L8

Second, I do think this happens for real workloads. As production
systems will have many sockets where SKBs (SLAB objects) can be queued,
and an unpredictable traffic pattern, that could cause this "mixed"
SLAB-object from different pages. The skbuff_head_cache size is 256 and
is using a order-1 page (8192/256=) 32 objects per page.  Netstack bulk
free mostly happens from (DMA) TX completion which have ring-sizes
usually between 512 to 1024 packets, although we do limit bulk free to
64 objects.


> > > For example, have a userspace process that does networking, which is 
> > > often memory allocation intensive, so that we know that SKBs traverse 
> > > between CPUs. You can do this by making sure that the NIC queues are 
> > > mapped to CPU N (so that network softirqs have to run on that CPU) but 
> > > the process is pinned to CPU M.  
> > 
> > If someone want to test this with SKBs then be-aware that we netdev-guys
> > have a number of optimizations where we try to counter act this. (As
> > minimum disable TSO and GRO).
> > 
> > It might also be possible for people to get inspired by and adapt the
> > micro benchmarking[2] kernel modules that I wrote when developing the
> > SLUB and SLAB optimizations:
> > 
> > [2] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm  
> 
> While microbenchmarks are good to see pathological behavior, I would be
> really interested to see some numbers for real world usecases.

Yes, I would love to see that too, but there is a gap between kernel
developers with the knowledge to diagnose/make-sense of this, and
people running production systems...

(Cc Brendan Gregg)
Maybe we should create some tracepoints that makes it possible to
measure, e.g. how often SLUB fast-path vs slow-path is hit (or other
behavior _you_ want to know about), and then create some easy to use
trace-tools that sysadms can run.  I bet Brendan could write some
bpftrace[3] script that does this, if someone can describe what we want
to measure...

[3] https://github.com/iovisor/bpftrace

 
> > > It's, of course, worth thinking about other pathological cases too. 
> > > Workloads that cause large allocations is one. Workloads that cause lots 
> > > of slab cache shrinking is another.  
> > 
> > I also worry about long uptimes when SLUB objects/pages gets too
> > fragmented... as I said SLUB is only efficient when objects are
> > returned to the same page, while SLAB is not.  
> 
> Is this something that has been actually measured in a real deployment?

This is also something that would be interesting to have a tool for,
that can answer: how fragmented are the SLUB objects in my production
system(?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-04-22 14:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20190410024714.26607-1-tobin@kernel.org>
     [not found] ` <f06aaeae-28c0-9ea4-d795-418ec3362d17@suse.cz>
     [not found]   ` <alpine.DEB.2.21.1904101452340.100430@chino.kir.corp.google.com>
     [not found]     ` <20190412112816.GD18914@techsingularity.net>
2019-04-17  3:52       ` [PATCH 0/1] mm: Remove the SLAB allocator Andrew Morton
     [not found]   ` <20190410081618.GA25494@eros.localdomain>
     [not found]     ` <20190411075556.GO10383@dhcp22.suse.cz>
     [not found]       ` <262df687-c934-b3e2-1d5f-548e8a8acb74@iki.fi>
2019-04-17  8:50         ` Jesper Dangaard Brouer
2019-04-17 13:27           ` Christopher Lameter
2019-04-17 13:38           ` Michal Hocko
2019-04-22 14:43             ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).