From: Andrew Morton <akpm@linux-foundation.org>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Pekka Enberg <penberg@cs.helsinki.fi>
Subject: Re: [patch 0/6] Per cpu structures for SLUB
Date: Fri, 24 Aug 2007 14:38:48 -0700 [thread overview]
Message-ID: <20070824143848.a1ecb6bc.akpm@linux-foundation.org> (raw)
In-Reply-To: <20070823064653.081843729@sgi.com>
On Wed, 22 Aug 2007 23:46:53 -0700
Christoph Lameter <clameter@sgi.com> wrote:
> The following patchset introduces per cpu structures for SLUB. These
> are very small (and multiples of these may fit into one cacheline)
> and (apart from performance improvements) allow the addressing of
> several isues in SLUB:
>
> 1. The number of objects per slab is no longer limited to a 16 bit
> number.
>
> 2. Room is freed up in the page struct. We can avoid using the
> mapping field which allows to get rid of the #ifdef CONFIG_SLUB
> in page_mapping().
>
> 3. We will have an easier time adding new things like Peter Z.s reserve
> management.
>
> The RFC for this patchset was discussed on lkml a while ago:
>
> http://marc.info/?l=linux-kernel&m=118386677704534&w=2
>
> (And no this patchset does not include the use of cmpxchg_local that
> we discussed recently on lkml nor the cmpxchg implementation
> mentioned in the RFC)
>
> Performance
> -----------
>
>
> Norm = 2.6.23-rc3
> PCPU = Adds page allocator pass through plus per cpu structure patches
>
>
> IA64 8p 4n NUMA Altix
>
> Single threaded Concurrent Alloc
>
> Kmalloc Alloc/Free Kmalloc Alloc/Free
> Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
> -------------------------------------------------------------------
> 8 132 84 93 104 98 90 95 106
> 16 98 92 93 104 115 98 95 106
> 32 112 105 93 104 146 111 95 106
> 64 119 112 93 104 214 133 95 106
> 128 132 119 94 104 321 163 95 106
> 256+ 83255 176 106 115 415 224 108 117
> 512 191 176 106 115 487 341 108 117
> 1024 252 246 106 115 937 609 108 117
> 2048 308 292 107 115 2494 1207 108 117
> 4096 341 319 107 115 2497 1217 108 117
> 8192 402 380 107 115 2367 1188 108 117
> 16384* 560 474 106 434 4464 1904 108 478
>
> X86_64 2p SMP (Dual Core Pentium 940)
>
> Single threaded Concurrent Alloc
>
> Kmalloc Alloc/Free Kmalloc Alloc/Free
> Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
> --------------------------------------------------------------------
> 8 313 227 314 324 207 208 314 323
> 16 202 203 315 324 209 211 312 321
> 32 212 207 314 324 251 243 312 321
> 64 240 237 314 326 329 306 312 321
> 128 301 302 314 324 511 416 313 324
> 256 498 554 327 332 970 837 326 332
> 512 532 553 324 332 1025 932 326 335
> 1024 705 718 325 333 1489 1231 324 330
> 2048 764 767 324 334 2708 2175 324 332
> 4096* 1033 476 325 674 4727 782 324 678
I'm struggling a bit to understand these numbers. Bigger is better, I
assume? In what units are these numbers?
> Notes:
>
> Worst case:
> -----------
> We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
> since the processing overhead increases because we need to lookup
> the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
> So objects with the shortest lifetime possible. We would never use
> objects in that way but the measurement is important to show the worst
> case overhead created.
>
> Single Threaded:
> ----------------
> The single threaded kmalloc test shows behavior of a continual stream
> of allocation without contention. In the SMP case the losses are minimal.
> In the NUMA case we already have a winner there because the per cpu structure
> is placed local to the processor. So in the single threaded case we already
> win around 5% just by placing things better.
>
> Concurrent Alloc:
> -----------------
> We have varying gains up to a 50% on NUMA because we are now never updating
> a cacheline used by the other processor and the data structures are local
> to the processor.
>
> The SMP case shows gains but they are smaller (especially since
> this is the smallest SMP system possible.... 2 CPUs). So only up
> to 25%.
>
> Page allocator pass through
> ---------------------------
> There is a significant difference in the columns marked with a * because
> of the way that allocations for page sized objects are handled.
OK, but what happened to the third pair of columns (Concurrent Alloc,
Kmalloc) for 1024 and 2048-byte allocations? They seem to have become
significantly slower?
Thanks for running the numbers, but it's still a bit hard to work out
whether these changes are an aggregate benefit?
> If we handle
> the allocations in the slab allocator (Norm) then the alloc free tests
> results are superb since we can use the per cpu slab to just pass a pointer
> back and forth. The page allocator pass through (PCPU) shows that the page
> allocator may have problems with giving back the same page after a free.
> Or there something else in the page allocator that creates significant
> overhead compared to slab. Needs to be checked out I guess.
>
> However, the page allocator pass through is a win in the other cases
> since we can cut out the page allocator overhead. That is the more typical
> load of allocating a sequence of objects and we should optimize for that.
>
> (+ = Must be some cache artifact here or code crossing a TLB boundary.
> The result is reproducable)
>
Most Linux machines are uniprocessor. We should keep an eye on what effect
a change like this has on code size and performance for CONFIG_SMP=n
builds..
WARNING: multiple messages have this Message-ID (diff)
From: Andrew Morton <akpm@linux-foundation.org>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Pekka Enberg <penberg@cs.helsinki.fi>
Subject: Re: [patch 0/6] Per cpu structures for SLUB
Date: Fri, 24 Aug 2007 14:38:48 -0700 [thread overview]
Message-ID: <20070824143848.a1ecb6bc.akpm@linux-foundation.org> (raw)
In-Reply-To: <20070823064653.081843729@sgi.com>
On Wed, 22 Aug 2007 23:46:53 -0700
Christoph Lameter <clameter@sgi.com> wrote:
> The following patchset introduces per cpu structures for SLUB. These
> are very small (and multiples of these may fit into one cacheline)
> and (apart from performance improvements) allow the addressing of
> several isues in SLUB:
>
> 1. The number of objects per slab is no longer limited to a 16 bit
> number.
>
> 2. Room is freed up in the page struct. We can avoid using the
> mapping field which allows to get rid of the #ifdef CONFIG_SLUB
> in page_mapping().
>
> 3. We will have an easier time adding new things like Peter Z.s reserve
> management.
>
> The RFC for this patchset was discussed on lkml a while ago:
>
> http://marc.info/?l=linux-kernel&m=118386677704534&w=2
>
> (And no this patchset does not include the use of cmpxchg_local that
> we discussed recently on lkml nor the cmpxchg implementation
> mentioned in the RFC)
>
> Performance
> -----------
>
>
> Norm = 2.6.23-rc3
> PCPU = Adds page allocator pass through plus per cpu structure patches
>
>
> IA64 8p 4n NUMA Altix
>
> Single threaded Concurrent Alloc
>
> Kmalloc Alloc/Free Kmalloc Alloc/Free
> Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
> -------------------------------------------------------------------
> 8 132 84 93 104 98 90 95 106
> 16 98 92 93 104 115 98 95 106
> 32 112 105 93 104 146 111 95 106
> 64 119 112 93 104 214 133 95 106
> 128 132 119 94 104 321 163 95 106
> 256+ 83255 176 106 115 415 224 108 117
> 512 191 176 106 115 487 341 108 117
> 1024 252 246 106 115 937 609 108 117
> 2048 308 292 107 115 2494 1207 108 117
> 4096 341 319 107 115 2497 1217 108 117
> 8192 402 380 107 115 2367 1188 108 117
> 16384* 560 474 106 434 4464 1904 108 478
>
> X86_64 2p SMP (Dual Core Pentium 940)
>
> Single threaded Concurrent Alloc
>
> Kmalloc Alloc/Free Kmalloc Alloc/Free
> Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
> --------------------------------------------------------------------
> 8 313 227 314 324 207 208 314 323
> 16 202 203 315 324 209 211 312 321
> 32 212 207 314 324 251 243 312 321
> 64 240 237 314 326 329 306 312 321
> 128 301 302 314 324 511 416 313 324
> 256 498 554 327 332 970 837 326 332
> 512 532 553 324 332 1025 932 326 335
> 1024 705 718 325 333 1489 1231 324 330
> 2048 764 767 324 334 2708 2175 324 332
> 4096* 1033 476 325 674 4727 782 324 678
I'm struggling a bit to understand these numbers. Bigger is better, I
assume? In what units are these numbers?
> Notes:
>
> Worst case:
> -----------
> We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
> since the processing overhead increases because we need to lookup
> the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
> So objects with the shortest lifetime possible. We would never use
> objects in that way but the measurement is important to show the worst
> case overhead created.
>
> Single Threaded:
> ----------------
> The single threaded kmalloc test shows behavior of a continual stream
> of allocation without contention. In the SMP case the losses are minimal.
> In the NUMA case we already have a winner there because the per cpu structure
> is placed local to the processor. So in the single threaded case we already
> win around 5% just by placing things better.
>
> Concurrent Alloc:
> -----------------
> We have varying gains up to a 50% on NUMA because we are now never updating
> a cacheline used by the other processor and the data structures are local
> to the processor.
>
> The SMP case shows gains but they are smaller (especially since
> this is the smallest SMP system possible.... 2 CPUs). So only up
> to 25%.
>
> Page allocator pass through
> ---------------------------
> There is a significant difference in the columns marked with a * because
> of the way that allocations for page sized objects are handled.
OK, but what happened to the third pair of columns (Concurrent Alloc,
Kmalloc) for 1024 and 2048-byte allocations? They seem to have become
significantly slower?
Thanks for running the numbers, but it's still a bit hard to work out
whether these changes are an aggregate benefit?
> If we handle
> the allocations in the slab allocator (Norm) then the alloc free tests
> results are superb since we can use the per cpu slab to just pass a pointer
> back and forth. The page allocator pass through (PCPU) shows that the page
> allocator may have problems with giving back the same page after a free.
> Or there something else in the page allocator that creates significant
> overhead compared to slab. Needs to be checked out I guess.
>
> However, the page allocator pass through is a win in the other cases
> since we can cut out the page allocator overhead. That is the more typical
> load of allocating a sequence of objects and we should optimize for that.
>
> (+ = Must be some cache artifact here or code crossing a TLB boundary.
> The result is reproducable)
>
Most Linux machines are uniprocessor. We should keep an eye on what effect
a change like this has on code size and performance for CONFIG_SMP=n
builds..
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-08-24 21:39 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 2/6] SLUB: Do not use page->mapping Christoph Lameter
2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset Christoph Lameter
2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab Christoph Lameter
2007-08-23 6:46 ` Christoph Lameter
2007-08-23 15:05 ` Peter Zijlstra
2007-08-23 19:30 ` Christoph Lameter
2007-08-24 16:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way Christoph Lameter
2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 6/6] SLUB: Optimize cacheline use for zeroing Christoph Lameter
2007-08-23 6:46 ` Christoph Lameter
2007-08-23 9:52 ` [patch 0/6] Per cpu structures for SLUB Peter Zijlstra
2007-08-23 19:25 ` Christoph Lameter
2007-08-24 21:38 ` Andrew Morton [this message]
2007-08-24 21:38 ` Andrew Morton
2007-08-27 18:50 ` Christoph Lameter
2007-08-27 18:50 ` Christoph Lameter
2007-08-27 23:51 ` Andrew Morton
2007-08-27 23:51 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070824143848.a1ecb6bc.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=penberg@cs.helsinki.fi \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.