From: Mel Gorman <mel@csn.ul.ie>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Nick Piggin <npiggin@suse.de>,
Pekka J Enberg <penberg@cs.helsinki.fi>
Subject: Re: SLUB tbench regression due to page allocator deficiency
Date: Mon, 11 Feb 2008 13:50:47 +0000 [thread overview]
Message-ID: <20080211135046.GD31903@csn.ul.ie> (raw)
In-Reply-To: <Pine.LNX.4.64.0802091332450.12965@schroedinger.engr.sgi.com>
On (09/02/08 13:45), Christoph Lameter didst pronounce:
> I have been chasing the tbench regression (1-4%) for two weeks now and
> even after I added statistics I could only verify that behavior was just
> optimal.
>
> None of the tricks that I threw at the problem changed anything until I
> realized that the tbench load depends heavily on 4k allocations that SLUB
> hands off to the page allocator (SLAB handles 4k itself). I extended the
> kmalloc array to 4k and I got:
This poked me into checking the results I got when comparing SLAB/SLUB
in 2.6.24. I ran a fairly large set of benchmarks and then failed to
follow up on it :/
The results I got for tbench were considerably worse than 4% (sorry for
the wide output). This
bl6-13/report.txt:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-debug)
bl6-13/report.txt- Min Average Max Std. Deviation
bl6-13/report.txt- --------------------------- --------------------------- --------------------------- ----------------------------
bl6-13/report.txt-clients-1 176.37/156.81 (-12.47%) 186.57/173.51 ( -7.53%) 204.71/209.94 ( 2.49%) 9.51/13.64 ( -43.38%)
bl6-13/report.txt-clients-2 319.70/282.60 (-13.13%) 347.16/313.87 (-10.61%) 414.66/343.35 (-20.77%) 21.12/12.45 ( 41.04%)
bl6-13/report.txt-clients-4 854.17/685.53 (-24.60%) 1024.46/845.32 (-21.19%) 1067.28/905.61 (-17.85%) 44.97/46.26 ( -2.87%)
bl6-13/report.txt-clients-8 974.06/835.80 (-16.54%) 1010.90/882.97 (-14.49%) 1027.36/917.22 (-12.01%) 13.68/19.84 ( -45.00%)
bl6-13/report.txt-
--
elm3a203/report.txt:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-debug)
elm3a203/report.txt- Min Average Max Std. Deviation
elm3a203/report.txt- --------------------------- --------------------------- --------------------------- ----------------------------
elm3a203/report.txt-clients-1 111.25/97.59 (-13.99%) 112.30/99.66 (-12.68%) 113.25/101.21 (-11.89%) 0.49/0.78 ( -59.29%)
elm3a203/report.txt-clients-1 112.28/97.39 (-15.29%) 113.13/99.68 (-13.50%) 113.79/100.58 (-13.13%) 0.32/0.87 (-176.48%)
elm3a203/report.txt-clients-2 149.01/131.90 (-12.97%) 151.04/136.51 (-10.64%) 152.52/139.26 ( -9.53%) 0.97/1.51 ( -55.79%)
elm3a203/report.txt-clients-4 145.94/130.05 (-12.22%) 147.62/132.33 (-11.56%) 148.92/134.26 (-10.92%) 0.88/1.10 ( -25.10%)
elm3a203/report.txt-
--
elm3b133/report.txt:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-debug)
elm3b133/report.txt- Min Average Max Std. Deviation
elm3b133/report.txt- --------------------------- --------------------------- --------------------------- ----------------------------
elm3b133/report.txt-clients-1 28.17/26.95 ( -4.53%) 28.34/27.25 ( -4.01%) 28.53/27.38 ( -4.22%) 0.09/0.10 ( -5.42%)
elm3b133/report.txt-clients-2 52.55/50.61 ( -3.83%) 53.20/51.28 ( -3.74%) 54.47/51.82 ( -5.11%) 0.49/0.33 ( 32.41%)
elm3b133/report.txt-clients-4 111.15/105.14 ( -5.71%) 113.29/107.29 ( -5.59%) 114.16/108.58 ( -5.13%) 0.69/0.91 ( -32.14%)
elm3b133/report.txt-clients-8 109.63/104.37 ( -5.04%) 110.14/104.78 ( -5.12%) 110.80/105.43 ( -5.10%) 0.25/0.27 ( -8.94%)
elm3b133/report.txt-
--
elm3b19/report.txt:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-debug)
elm3b19/report.txt- Min Average Max Std. Deviation
elm3b19/report.txt- --------------------------- --------------------------- --------------------------- ----------------------------
elm3b19/report.txt-clients-1 118.85/0.00 ( 0.00%) 123.72/115.79 ( -6.84%) 131.11/129.94 ( -0.90%) 3.77/26.77 (-609.67%)
elm3b19/report.txt-clients-1 118.68/117.89 ( -0.67%) 124.65/123.52 ( -0.91%) 137.54/132.09 ( -4.13%) 5.52/4.20 ( 23.78%)
elm3b19/report.txt-clients-2 223.73/211.77 ( -5.64%) 339.06/334.21 ( -1.45%) 367.83/357.20 ( -2.97%) 38.36/30.30 ( 21.03%)
elm3b19/report.txt-clients-4 320.07/316.04 ( -1.28%) 331.93/324.42 ( -2.31%) 341.92/332.29 ( -2.90%) 5.51/4.07 ( 26.03%)
elm3b19/report.txt-
--
elm3b6/report.txt:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-debug)
elm3b6/report.txt- Min Average Max Std. Deviation
elm3b6/report.txt- --------------------------- --------------------------- --------------------------- ----------------------------
elm3b6/report.txt-clients-1 148.01/140.79 ( -5.13%) 156.19/153.67 ( -1.64%) 182.30/185.11 ( 1.52%) 9.76/13.84 ( -41.84%)
elm3b6/report.txt-clients-2 251.07/253.07 ( 0.79%) 292.60/286.59 ( -2.10%) 338.81/360.93 ( 6.13%) 22.58/21.48 ( 4.85%)
elm3b6/report.txt-clients-4 673.43/523.51 (-28.64%) 784.56/761.89 ( -2.98%) 846.40/818.38 ( -3.42%) 36.95/82.30 (-122.75%)
elm3b6/report.txt-clients-8 652.73/700.72 ( 6.85%) 783.54/772.22 ( -1.47%) 833.56/812.21 ( -2.63%) 47.45/27.48 ( 42.09%)
elm3b6/report.txt-
--
gekko-lp1/report.txt:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-debug)
gekko-lp1/report.txt- Min Average Max Std. Deviation
gekko-lp1/report.txt- --------------------------- --------------------------- --------------------------- ----------------------------
gekko-lp1/report.txt-clients-1 170.56/163.15 ( -4.55%) 206.59/194.96 ( -5.96%) 221.27/206.46 ( -7.17%) 17.06/13.59 ( 20.34%)
gekko-lp1/report.txt-clients-2 302.55/277.45 ( -9.05%) 319.14/306.39 ( -4.16%) 328.74/313.01 ( -5.03%) 6.21/8.18 ( -31.81%)
gekko-lp1/report.txt-clients-4 467.98/393.05 (-19.06%) 490.42/464.23 ( -5.64%) 503.74/477.13 ( -5.58%) 10.49/17.68 ( -68.61%)
gekko-lp1/report.txt-clients-8 469.16/447.00 ( -4.96%) 492.14/468.37 ( -5.07%) 498.79/472.47 ( -5.57%) 7.08/5.61 ( 20.79%)
gekko-lp1/report.txt-
I think I didn't look too closely because kernbench was generally ok,
hackbench showed gains and losses depending on the machine and as TBench
has historically been a bit all over the place. That was a mistake
though as there was a definite slow-up even with the variances taken
into account.
>
> christoph@stapp:~$ slabinfo -AD
> Name Objects Alloc Free %Fast
> :0004096 180 665259550 665259415 99 99
> skbuff_fclone_cache 46 665196592 665196592 99 99
> :0000192 2575 31232665 31230129 99 99
> :0001024 854 31204838 31204006 99 99
> vm_area_struct 1093 108941 107954 91 17
> dentry 7738 26248 18544 92 43
> :0000064 2179 19208 17287 97 73
>
> So the kmalloc-4096 is heavily used. If I give the 4k objects a reasonable
> allocation size in slub (PAGE_ALLOC_COSTLY_ORDER) then the fastpath of
> SLUB becomes effective for 4k allocs and then SLUB is faster than SLAB
> here.
>
> Performance on tbench (Dual Quad 8p 8G):
>
> SLAB 2223.32 MB/sec
> SLUB unmodified 2144.36 MB/sec
> SLUB+patch 2245.56 MB/sec (stats still active so this isnt optimal yet)
>
I'll run tests for this patch and see what it looks like.
> 4k allocations cannot optimally be handled by SLUB if we are restricted to
> order 0 allocs because the fastpath only handles fractions of one
> allocation unit and if the allocation unit is 4k then we only have one
> object per slab.
>
> Isnt there a way that we can make the page allocator handle PAGE_SIZEd
> allocations in such a way that is competitive with the slab allocators?
Probably. It's been on my TODO list for an age to see what can be done.
> The cycle count for an allocation needs to be <100 not just below 1000 as
> it is now.
>
> ---
> include/linux/slub_def.h | 6 +++---
> mm/slub.c | 25 +++++++++++++++++--------
> 2 files changed, 20 insertions(+), 11 deletions(-)
>
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h 2008-02-09 13:04:48.464203968 -0800
> +++ linux-2.6/include/linux/slub_def.h 2008-02-09 13:08:37.413120259 -0800
> @@ -110,7 +110,7 @@ struct kmem_cache {
> * We keep the general caches in an array of slab caches that are used for
> * 2^x bytes of allocations.
> */
> -extern struct kmem_cache kmalloc_caches[PAGE_SHIFT];
> +extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
>
> /*
> * Sorry that the following has to be that ugly but some versions of GCC
> @@ -191,7 +191,7 @@ void *__kmalloc(size_t size, gfp_t flags
> static __always_inline void *kmalloc(size_t size, gfp_t flags)
> {
> if (__builtin_constant_p(size)) {
> - if (size > PAGE_SIZE / 2)
> + if (size > PAGE_SIZE)
> return (void *)__get_free_pages(flags | __GFP_COMP,
> get_order(size));
>
> @@ -214,7 +214,7 @@ void *kmem_cache_alloc_node(struct kmem_
> static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
> {
> if (__builtin_constant_p(size) &&
> - size <= PAGE_SIZE / 2 && !(flags & SLUB_DMA)) {
> + size <= PAGE_SIZE && !(flags & SLUB_DMA)) {
> struct kmem_cache *s = kmalloc_slab(size);
>
> if (!s)
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-02-09 13:04:48.472203975 -0800
> +++ linux-2.6/mm/slub.c 2008-02-09 13:14:43.786633258 -0800
> @@ -1919,6 +1919,15 @@ static inline int calculate_order(int si
> int fraction;
>
> /*
> + * Cover up bad performance of page allocator fastpath vs
> + * slab allocator fastpaths. Take the largest order reasonable
> + * in order to be able to avoid partial list overhead.
> + *
> + * This yields 8 4k objects per 32k slab allocation.
> + */
> + if (size == PAGE_SIZE)
> + return PAGE_ALLOC_COSTLY_ORDER;
> + /*
> * Attempt to find best configuration for a slab. This
> * works by first attempting to generate a layout with
> * the best configuration and backing off gradually.
> @@ -2484,11 +2493,11 @@ EXPORT_SYMBOL(kmem_cache_destroy);
> * Kmalloc subsystem
> *******************************************************************/
>
> -struct kmem_cache kmalloc_caches[PAGE_SHIFT] __cacheline_aligned;
> +struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
> EXPORT_SYMBOL(kmalloc_caches);
>
> #ifdef CONFIG_ZONE_DMA
> -static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT];
> +static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
> #endif
>
> static int __init setup_slub_min_order(char *str)
> @@ -2670,7 +2679,7 @@ void *__kmalloc(size_t size, gfp_t flags
> {
> struct kmem_cache *s;
>
> - if (unlikely(size > PAGE_SIZE / 2))
> + if (unlikely(size > PAGE_SIZE))
> return (void *)__get_free_pages(flags | __GFP_COMP,
> get_order(size));
>
> @@ -2688,7 +2697,7 @@ void *__kmalloc_node(size_t size, gfp_t
> {
> struct kmem_cache *s;
>
> - if (unlikely(size > PAGE_SIZE / 2))
> + if (unlikely(size > PAGE_SIZE))
> return (void *)__get_free_pages(flags | __GFP_COMP,
> get_order(size));
>
> @@ -3001,7 +3010,7 @@ void __init kmem_cache_init(void)
> caches++;
> }
>
> - for (i = KMALLOC_SHIFT_LOW; i < PAGE_SHIFT; i++) {
> + for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
> create_kmalloc_cache(&kmalloc_caches[i],
> "kmalloc", 1 << i, GFP_KERNEL);
> caches++;
> @@ -3028,7 +3037,7 @@ void __init kmem_cache_init(void)
> slab_state = UP;
>
> /* Provide the correct kmalloc names now that the caches are up */
> - for (i = KMALLOC_SHIFT_LOW; i < PAGE_SHIFT; i++)
> + for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
> kmalloc_caches[i]. name =
> kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
>
> @@ -3218,7 +3227,7 @@ void *__kmalloc_track_caller(size_t size
> {
> struct kmem_cache *s;
>
> - if (unlikely(size > PAGE_SIZE / 2))
> + if (unlikely(size > PAGE_SIZE))
> return (void *)__get_free_pages(gfpflags | __GFP_COMP,
> get_order(size));
> s = get_slab(size, gfpflags);
> @@ -3234,7 +3243,7 @@ void *__kmalloc_node_track_caller(size_t
> {
> struct kmem_cache *s;
>
> - if (unlikely(size > PAGE_SIZE / 2))
> + if (unlikely(size > PAGE_SIZE))
> return (void *)__get_free_pages(gfpflags | __GFP_COMP,
> get_order(size));
> s = get_slab(size, gfpflags);
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-02-11 13:50 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-09 21:45 SLUB tbench regression due to page allocator deficiency Christoph Lameter
2008-02-09 22:35 ` Andrew Morton
2008-02-10 0:19 ` Christoph Lameter
2008-02-10 2:45 ` Nick Piggin
2008-02-10 3:36 ` Christoph Lameter
2008-02-10 3:39 ` Christoph Lameter
2008-02-10 23:24 ` Nick Piggin
2008-02-11 19:14 ` Christoph Lameter
2008-02-11 22:03 ` Christoph Lameter
2008-02-11 7:18 ` Nick Piggin
2008-02-11 19:21 ` Christoph Lameter
2008-02-11 23:40 ` Nick Piggin
2008-02-11 23:42 ` Christoph Lameter
2008-02-11 23:56 ` Nick Piggin
2008-02-12 0:08 ` Christoph Lameter
2008-02-12 6:06 ` Fastpath prototype? Christoph Lameter
2008-02-12 10:40 ` Andi Kleen
2008-02-12 20:10 ` Christoph Lameter
2008-02-12 22:31 ` Christoph Lameter
2008-02-13 11:38 ` Andi Kleen
2008-02-13 20:09 ` Christoph Lameter
2008-02-13 18:33 ` SLUB tbench regression due to page allocator deficiency Paul Jackson
2008-02-11 13:50 ` Mel Gorman [this message]
2008-02-13 11:15 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080211135046.GD31903@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=penberg@cs.helsinki.fi \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.