From: Mel Gorman <mel@csn.ul.ie>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Nick Piggin <npiggin@suse.de>,
Pekka J Enberg <penberg@cs.helsinki.fi>
Subject: Re: SLUB tbench regression due to page allocator deficiency
Date: Wed, 13 Feb 2008 11:15:17 +0000 [thread overview]
Message-ID: <20080213111516.GA4007@csn.ul.ie> (raw)
In-Reply-To: <Pine.LNX.4.64.0802091332450.12965@schroedinger.engr.sgi.com>
On (09/02/08 13:45), Christoph Lameter didst pronounce:
> I have been chasing the tbench regression (1-4%) for two weeks now and
> even after I added statistics I could only verify that behavior was just
> optimal.
>
> None of the tricks that I threw at the problem changed anything until I
> realized that the tbench load depends heavily on 4k allocations that SLUB
> hands off to the page allocator (SLAB handles 4k itself). I extended the
> kmalloc array to 4k and I got:
>
> christoph@stapp:~$ slabinfo -AD
> Name Objects Alloc Free %Fast
> :0004096 180 665259550 665259415 99 99
> skbuff_fclone_cache 46 665196592 665196592 99 99
> :0000192 2575 31232665 31230129 99 99
> :0001024 854 31204838 31204006 99 99
> vm_area_struct 1093 108941 107954 91 17
> dentry 7738 26248 18544 92 43
> :0000064 2179 19208 17287 97 73
>
> So the kmalloc-4096 is heavily used. If I give the 4k objects a reasonable
> allocation size in slub (PAGE_ALLOC_COSTLY_ORDER) then the fastpath of
> SLUB becomes effective for 4k allocs and then SLUB is faster than SLAB
> here.
>
> Performance on tbench (Dual Quad 8p 8G):
>
> SLAB 2223.32 MB/sec
> SLUB unmodified 2144.36 MB/sec
> SLUB+patch 2245.56 MB/sec (stats still active so this isnt optimal yet)
>
I ran similar tests for tbench and also sysbench as it is closer to a
real workload. I have results from other tests as well although the oddest
was HackBench which showed +/- 30% performance gains/losses depending on
the machine.
In the results, the tbench comparisons are between slab and SLUB-lameter
which is the first patch posted in this thread. I posted up the figures
of slab vs slub-vanilla already. sysbench compares 2.6.23, 2.6.24-slab,
2.6.24-slub-vanilla and 2.6.24-slub-lameter.
Short answer: slub-lameter appears to be usually a win in many cases over
slab. However, such different behaviours between machines on even small
tests is something to be wary of. In one machine, this patch makes SLUB
slower but it was not typical.
Note that sysbench was not run everywhere as some crinkles in the
automation that prevent it running everywhere are still being ironed
out. Incidentally, its scalability sucks. Over 8 threads, performance
starts dropping sharply but I haven't checked out a different userspace
allocator yet as the system malloc was identified as a problem in the past
(http://ozlabs.org/~anton/linux/sysbench/).
elm3a238:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-lameter)
elm3a238- Min Average Max Std. Deviation
elm3a238- --------------------------- --------------------------- --------------------------- ----------------------------
elm3a238-clients-1 80.00/82.66 ( 3.22%) 84.22/83.43 ( -0.95%) 84.77/83.84 ( -1.11%) 1.00/0.25 ( 75.20%)
elm3a238-clients-1 84.00/83.01 ( -1.20%) 84.41/83.42 ( -1.18%) 84.87/83.73 ( -1.36%) 0.20/0.22 ( -11.47%)
elm3a238-clients-2 115.71/114.89 ( -0.71%) 117.03/115.25 ( -1.55%) 117.71/115.79 ( -1.65%) 0.41/0.25 ( 39.51%)
elm3a238-clients-4 116.37/113.40 ( -2.63%) 116.81/113.78 ( -2.66%) 117.24/114.24 ( -2.62%) 0.24/0.21 ( 13.45%)
elm3a238-
sysbench: http://www.csn.ul.ie/~mel/postings/tsysbench-20080213/elm3a238-comparison.ps
Still showing a small regression against SLAB here. However, sysbench tells
a different story. 2.6.23 was fastest but 2.6.24-slub-vanilla was faster
than slab on 2.6.24. This patch made sysbench at least slower on this machine.
==
elm3a69:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-lameter)
elm3a69- Min Average Max Std. Deviation
elm3a69- --------------------------- --------------------------- --------------------------- ----------------------------
elm3a69-clients-1 174.21/173.15 ( -0.62%) 174.83/174.13 ( -0.41%) 175.52/175.22 ( -0.17%) 0.39/0.50 ( -29.83%)
elm3a69-clients-1 173.73/173.94 ( 0.12%) 175.10/174.35 ( -0.43%) 175.97/174.92 ( -0.60%) 0.52/0.22 ( 57.76%)
elm3a69-clients-2 261.58/256.71 ( -1.90%) 299.03/301.13 ( 0.70%) 319.82/318.36 ( -0.46%) 23.85/23.56 ( 1.22%)
elm3a69-clients-4 312.55/308.88 ( -1.19%) 316.44/313.57 ( -0.92%) 319.31/316.10 ( -1.02%) 1.61/1.76 ( -9.28%)
elm3a69-
sysbench: unavailable
The patch makes SLUB and SLAB comparable on this machine. For example,
with 2 clients, it was previously a 4.24% regression and here it shows a
0.70% gain. However, the difference between kernels is within the standard
deviation of multiple runs so the only conclusion is to say that with the
patch the two allocators become comparable.
==
elm3b133:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-lameter)
elm3b133- Min Average Max Std. Deviation
elm3b133- --------------------------- --------------------------- --------------------------- ----------------------------
elm3b133-clients-1 28.00/28.25 ( 0.89%) 28.23/28.46 ( 0.81%) 28.42/28.63 ( 0.73%) 0.09/0.11 ( -22.52%)
elm3b133-clients-2 52.59/52.70 ( 0.20%) 53.33/53.86 ( 0.98%) 53.93/54.97 ( 1.89%) 0.45/0.52 ( -16.18%)
elm3b133-clients-4 111.24/110.89 ( -0.31%) 112.75/114.51 ( 1.53%) 114.07/115.77 ( 1.46%) 0.78/1.38 ( -76.86%)
elm3b133-clients-8 110.03/110.13 ( 0.09%) 110.57/110.99 ( 0.38%) 111.06/111.55 ( 0.44%) 0.25/0.33 ( -34.68%)
sysbench: unavailable
The patch is clearly a win on this machine. Regressions were between
3.7% and 5.6%. With the patch applied, it's mainly gains.
==
elm3b19:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-lameter)
elm3b19- Min Average Max Std. Deviation
elm3b19- --------------------------- --------------------------- --------------------------- ----------------------------
elm3b19-clients-1 115.39/119.30 ( 3.27%) 118.53/124.61 ( 4.88%) 126.01/158.10 ( 20.29%) 2.79/8.61 (-208.87%)
elm3b19-clients-1 116.56/117.51 ( 0.81%) 120.83/123.23 ( 1.95%) 131.91/130.33 ( -1.21%) 3.68/3.24 ( 11.85%)
elm3b19-clients-2 255.60/350.33 ( 27.04%) 345.43/365.53 ( 5.50%) 366.62/375.17 ( 2.28%) 27.64/7.05 ( 74.49%)
elm3b19-clients-4 323.56/324.04 ( 0.15%) 334.94/334.64 ( -0.09%) 344.60/339.71 ( -1.44%) 4.98/3.73 ( 25.05%)
sysbench: unavalable
The patch is even more clearly a win on this machine for tbench. Went from
losses of between 0.9% and 6.8% to decent gains in some cases.
==
elm3b6:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-lameter)
elm3b6- Min Average Max Std. Deviation
elm3b6- --------------------------- --------------------------- --------------------------- ----------------------------
elm3b6-clients-1 138.88/143.41 ( 3.16%) 152.46/154.96 ( 1.62%) 176.49/179.71 ( 1.79%) 10.05/10.84 ( -7.88%)
elm3b6-clients-2 264.83/263.46 ( -0.52%) 289.35/290.40 ( 0.36%) 316.65/337.88 ( 6.28%) 12.49/19.85 ( -58.99%)
elm3b6-clients-4 704.21/642.25 ( -9.65%) 751.06/764.37 ( 1.74%) 778.11/812.44 ( 4.23%) 20.93/45.31 (-116.53%)
elm3b6-clients-8 635.54/650.29 ( 2.27%) 732.18/745.58 ( 1.80%) 799.16/794.92 ( -0.53%) 49.19/42.10 ( 14.43%)
sysbench:http://www.csn.ul.ie/~mel/postings/tsysbench-20080213/elm3b6-comparison.ps
Again solid gains. Went from losses of around 2% to gains of of around
1%. sysbench is less clear cut but slub-lameter inches slightly ahead more
often than not versus slab.
==
gekko-lp1:TBench Throughput Comparisons (2.6.24.2-slab/2.6.24.2-slub-lameter)
gekko-lp1- Min Average Max Std. Deviation
gekko-lp1- --------------------------- --------------------------- --------------------------- ----------------------------
gekko-lp1-clients-1 169.17/176.54 ( 4.18%) 198.45/213.08 ( 6.87%) 219.36/227.24 ( 3.47%) 16.80/14.73 ( 12.31%)
gekko-lp1-clients-2 308.51/319.39 ( 3.41%) 323.06/329.19 ( 1.86%) 333.15/337.09 ( 1.17%) 7.04/4.43 ( 37.10%)
gekko-lp1-clients-4 465.10/390.12 (-19.22%) 494.48/493.46 ( -0.21%) 508.70/516.23 ( 1.46%) 11.28/33.51 (-196.97%)
gekko-lp1-clients-8 476.20/435.68 ( -9.30%) 494.68/505.39 ( 2.12%) 504.11/513.86 ( 1.90%) 8.89/16.92 ( -90.46%)
sysbench: unavailable
Once again, losses of up to 6% without the patch to gains of 6.8% with. Win.
==
gekko-lp4:TBench Throughput Comparisons (2.6.24.2-slab/2.6.24.2-slub-lameter)
gekko-lp4- Min Average Max Std. Deviation
gekko-lp4- --------------------------- --------------------------- --------------------------- ----------------------------
gekko-lp4-clients-1 167.17/190.43 ( 12.21%) 167.96/190.72 ( 11.93%) 169.27/191.04 ( 11.40%) 0.42/0.18 ( 58.78%)
gekko-lp4-clients-1 166.89/190.70 ( 12.48%) 167.88/191.25 ( 12.22%) 169.13/192.56 ( 12.17%) 0.47/0.52 ( -9.98%)
gekko-lp4-clients-2 250.79/300.33 ( 16.50%) 257.55/305.09 ( 15.58%) 260.71/309.14 ( 15.67%) 2.50/2.61 ( -4.49%)
gekko-lp4-clients-4 258.46/297.84 ( 13.22%) 259.18/303.32 ( 14.55%) 259.76/307.08 ( 15.41%) 0.44/2.62 (-494.48%)
sysbench: http://www.csn.ul.ie/~mel/postings/tsysbench-20080213/gekko-lp4-comparison.ps
Big gains here in tbench with the patch. Annoyingly, I find as I write this
I don't have tbench figures comparing slab with vanilla slub but I have no
reason to believe there is anything anomalous there. sysbench shows that SLUB
wins big over 2.6.24-slab on this machine although oddly it's only comparable
with 2.6.23-slab. The patch does not show any significant difference between
the two slub comparisons. So, on this machine the patch doesn't hurt.
==
bl6-13:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-lameter)
bl6-13- Min Average Max Std. Deviation
bl6-13- --------------------------- --------------------------- --------------------------- ----------------------------
bl6-13-clients-1 178.31/164.15 ( -8.63%) 204.12/186.17 ( -9.64%) 265.64/230.74 (-15.12%) 24.44/19.71 ( 19.35%)
bl6-13-clients-2 305.58/304.75 ( -0.27%) 346.98/341.09 ( -1.72%) 406.30/405.27 ( -0.25%) 19.77/25.33 ( -28.15%)
bl6-13-clients-4 868.05/839.52 ( -3.40%) 990.49/893.36 (-10.87%) 1054.08/970.95 ( -8.56%) 47.59/37.87 ( 20.42%)
bl6-13-clients-8 927.69/770.35 (-20.42%) 1003.17/894.60 (-12.14%) 1030.28/930.52 (-10.72%) 23.46/33.32 ( -42.05%)
sysbench: http://www.csn.ul.ie/~mel/postings/tsysbench-20080213/bl6-13-comparison.ps
Even with the patch, tbench blows on this machine. Without the patch it was
between 7% and 21% regression though so it's still an improvement. It's
worth noting that this machine routinely shows up big difference between
kernel versions with all small benchmarks so it's hard to draw a conclusion
from tbench here. sysbench shows significant gains over 2.6.23-slab in all
cases. 2.6.24-slab is marginally better than slub and the patch makes no
big difference to sysbench on this machine. Like, gekko-lp4 - the patch
doesn't hurt.
==
elm3a203:TBench Throughput Comparisons (2.6.24-slab/2.6.24-slub-lameter)
elm3a203- Min Average Max Std. Deviation
elm3a203- --------------------------- --------------------------- --------------------------- ----------------------------
elm3a203-clients-1 111.14/108.56 ( -2.37%) 112.46/109.65 ( -2.56%) 113.08/110.40 ( -2.43%) 0.50/0.38 ( 23.06%)
elm3a203-clients-1 111.57/108.41 ( -2.91%) 112.64/109.63 ( -2.75%) 113.48/110.56 ( -2.64%) 0.49/0.50 ( -1.53%)
elm3a203-clients-2 148.50/144.59 ( -2.70%) 151.82/147.48 ( -2.94%) 153.75/149.10 ( -3.12%) 1.27/1.31 ( -2.91%)
elm3a203-clients-4 146.39/140.32 ( -4.32%) 148.80/142.32 ( -4.55%) 150.70/143.56 ( -4.97%) 0.87/0.87 ( -0.99%)
elm3a203-
sysbench: unavailable
SLUB is a loss on this machine but similar to bl6-13, it went from regressions
of 10-13% to regressions of 2-4% so it is still an improvement.
==
So at the end of all that, it is very clear that modifications to this path
are as not as clear-cut a win/loss as one might like. Despite the lack of
clarity, the patch appears to be a plus on the balance in many cases so
Acked-by: Mel Gorman <mel@csn.ul.ie>
> 4k allocations cannot optimally be handled by SLUB if we are restricted to
> order 0 allocs because the fastpath only handles fractions of one
> allocation unit and if the allocation unit is 4k then we only have one
> object per slab.
>
> Isnt there a way that we can make the page allocator handle PAGE_SIZEd
> allocations in such a way that is competitive with the slab allocators?
> The cycle count for an allocation needs to be <100 not just below 1000 as
> it is now.
>
> ---
> include/linux/slub_def.h | 6 +++---
> mm/slub.c | 25 +++++++++++++++++--------
> 2 files changed, 20 insertions(+), 11 deletions(-)
>
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h 2008-02-09 13:04:48.464203968 -0800
> +++ linux-2.6/include/linux/slub_def.h 2008-02-09 13:08:37.413120259 -0800
> @@ -110,7 +110,7 @@ struct kmem_cache {
> * We keep the general caches in an array of slab caches that are used for
> * 2^x bytes of allocations.
> */
> -extern struct kmem_cache kmalloc_caches[PAGE_SHIFT];
> +extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
>
> /*
> * Sorry that the following has to be that ugly but some versions of GCC
> @@ -191,7 +191,7 @@ void *__kmalloc(size_t size, gfp_t flags
> static __always_inline void *kmalloc(size_t size, gfp_t flags)
> {
> if (__builtin_constant_p(size)) {
> - if (size > PAGE_SIZE / 2)
> + if (size > PAGE_SIZE)
> return (void *)__get_free_pages(flags | __GFP_COMP,
> get_order(size));
>
> @@ -214,7 +214,7 @@ void *kmem_cache_alloc_node(struct kmem_
> static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
> {
> if (__builtin_constant_p(size) &&
> - size <= PAGE_SIZE / 2 && !(flags & SLUB_DMA)) {
> + size <= PAGE_SIZE && !(flags & SLUB_DMA)) {
> struct kmem_cache *s = kmalloc_slab(size);
>
> if (!s)
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-02-09 13:04:48.472203975 -0800
> +++ linux-2.6/mm/slub.c 2008-02-09 13:14:43.786633258 -0800
> @@ -1919,6 +1919,15 @@ static inline int calculate_order(int si
> int fraction;
>
> /*
> + * Cover up bad performance of page allocator fastpath vs
> + * slab allocator fastpaths. Take the largest order reasonable
> + * in order to be able to avoid partial list overhead.
> + *
> + * This yields 8 4k objects per 32k slab allocation.
> + */
> + if (size == PAGE_SIZE)
> + return PAGE_ALLOC_COSTLY_ORDER;
> + /*
> * Attempt to find best configuration for a slab. This
> * works by first attempting to generate a layout with
> * the best configuration and backing off gradually.
> @@ -2484,11 +2493,11 @@ EXPORT_SYMBOL(kmem_cache_destroy);
> * Kmalloc subsystem
> *******************************************************************/
>
> -struct kmem_cache kmalloc_caches[PAGE_SHIFT] __cacheline_aligned;
> +struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
> EXPORT_SYMBOL(kmalloc_caches);
>
> #ifdef CONFIG_ZONE_DMA
> -static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT];
> +static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
> #endif
>
> static int __init setup_slub_min_order(char *str)
> @@ -2670,7 +2679,7 @@ void *__kmalloc(size_t size, gfp_t flags
> {
> struct kmem_cache *s;
>
> - if (unlikely(size > PAGE_SIZE / 2))
> + if (unlikely(size > PAGE_SIZE))
> return (void *)__get_free_pages(flags | __GFP_COMP,
> get_order(size));
>
> @@ -2688,7 +2697,7 @@ void *__kmalloc_node(size_t size, gfp_t
> {
> struct kmem_cache *s;
>
> - if (unlikely(size > PAGE_SIZE / 2))
> + if (unlikely(size > PAGE_SIZE))
> return (void *)__get_free_pages(flags | __GFP_COMP,
> get_order(size));
>
> @@ -3001,7 +3010,7 @@ void __init kmem_cache_init(void)
> caches++;
> }
>
> - for (i = KMALLOC_SHIFT_LOW; i < PAGE_SHIFT; i++) {
> + for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
> create_kmalloc_cache(&kmalloc_caches[i],
> "kmalloc", 1 << i, GFP_KERNEL);
> caches++;
> @@ -3028,7 +3037,7 @@ void __init kmem_cache_init(void)
> slab_state = UP;
>
> /* Provide the correct kmalloc names now that the caches are up */
> - for (i = KMALLOC_SHIFT_LOW; i < PAGE_SHIFT; i++)
> + for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
> kmalloc_caches[i]. name =
> kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
>
> @@ -3218,7 +3227,7 @@ void *__kmalloc_track_caller(size_t size
> {
> struct kmem_cache *s;
>
> - if (unlikely(size > PAGE_SIZE / 2))
> + if (unlikely(size > PAGE_SIZE))
> return (void *)__get_free_pages(gfpflags | __GFP_COMP,
> get_order(size));
> s = get_slab(size, gfpflags);
> @@ -3234,7 +3243,7 @@ void *__kmalloc_node_track_caller(size_t
> {
> struct kmem_cache *s;
>
> - if (unlikely(size > PAGE_SIZE / 2))
> + if (unlikely(size > PAGE_SIZE))
> return (void *)__get_free_pages(gfpflags | __GFP_COMP,
> get_order(size));
> s = get_slab(size, gfpflags);
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2008-02-13 11:15 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-09 21:45 SLUB tbench regression due to page allocator deficiency Christoph Lameter
2008-02-09 22:35 ` Andrew Morton
2008-02-10 0:19 ` Christoph Lameter
2008-02-10 2:45 ` Nick Piggin
2008-02-10 3:36 ` Christoph Lameter
2008-02-10 3:39 ` Christoph Lameter
2008-02-10 23:24 ` Nick Piggin
2008-02-11 19:14 ` Christoph Lameter
2008-02-11 22:03 ` Christoph Lameter
2008-02-11 7:18 ` Nick Piggin
2008-02-11 19:21 ` Christoph Lameter
2008-02-11 23:40 ` Nick Piggin
2008-02-11 23:42 ` Christoph Lameter
2008-02-11 23:56 ` Nick Piggin
2008-02-12 0:08 ` Christoph Lameter
2008-02-12 6:06 ` Fastpath prototype? Christoph Lameter
2008-02-12 10:40 ` Andi Kleen
2008-02-12 20:10 ` Christoph Lameter
2008-02-12 22:31 ` Christoph Lameter
2008-02-13 11:38 ` Andi Kleen
2008-02-13 20:09 ` Christoph Lameter
2008-02-13 18:33 ` SLUB tbench regression due to page allocator deficiency Paul Jackson
2008-02-11 13:50 ` Mel Gorman
2008-02-13 11:15 ` Mel Gorman [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080213111516.GA4007@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=penberg@cs.helsinki.fi \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).