Re: Mainline kernel OLTP performance update

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Mainline kernel OLTP performance update
       [not found]   ` <20090115201210.ca1a9542.akpm@linux-foundation.org>
@ 2009-01-16  6:46     ` Nick Piggin
  2009-01-16  6:55       ` Matthew Wilcox
                         ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Nick Piggin @ 2009-01-16  6:46 UTC (permalink / raw)
  To: Andrew Morton, netdev, sfr
  Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > I would like to see SLQB merged in mainline, made default, and wait for
> > some number releases. Then we take what we know, and try to make an
> > informed decision about the best one to take. I guess that is problematic
> > in that the rest of the kernel is moving underneath us. Do you have
> > another idea?
>
> Nope.  If it doesn't work out, we can remove it again I guess.

OK, I have these numbers to show I'm not completely off my rocker to suggest
we merge SLQB :) Given these results, how about I ask to merge SLQB as default
in linux-next, then if nothing catastrophic happens, merge it upstream in the
next merge window, then a couple of releases after that, given some time to
test and tweak SLQB, then we plan to bite the bullet and emerge with just one
main slab allocator (plus SLOB).

System is a 2socket, 4 core AMD. All debug and stats options turned off for
all the allocators; default parameters (ie. SLUB using higher order pages,
and the others tend to be using order-0). SLQB is the version I recently
posted, with some of the prefetching removed according to Pekka's review
(probably a good idea to only add things like that in if/when they prove to
be an improvement).

time fio examples/netio (10 runs, lower better):
SLAB AVG=13.19 STD=0.40
SLQB AVG=13.78 STD=0.24
SLUB AVG=14.47 STD=0.23

SLAB makes a good showing here. The allocation/freeing pattern seems to be
very regular and easy (fast allocs and frees). So it could be some "lucky"
caching behaviour, I'm not exactly sure. I'll have to run more tests and
profiles here.

hackbench (10 runs, lower better):
1 GROUP
SLAB AVG=1.34 STD=0.05
SLQB AVG=1.31 STD=0.06
SLUB AVG=1.46 STD=0.07

2 GROUPS
SLAB AVG=1.20 STD=0.09
SLQB AVG=1.22 STD=0.12
SLUB AVG=1.21 STD=0.06

4 GROUPS
SLAB AVG=0.84 STD=0.05
SLQB AVG=0.81 STD=0.10
SLUB AVG=0.98 STD=0.07

8 GROUPS
SLAB AVG=0.79 STD=0.10
SLQB AVG=0.76 STD=0.15
SLUB AVG=0.89 STD=0.08

16 GROUPS
SLAB AVG=0.78 STD=0.08
SLQB AVG=0.79 STD=0.10
SLUB AVG=0.86 STD=0.05

32 GROUPS
SLAB AVG=0.86 STD=0.05
SLQB AVG=0.78 STD=0.06
SLUB AVG=0.88 STD=0.06

64 GROUPS
SLAB AVG=1.03 STD=0.05
SLQB AVG=0.90 STD=0.04
SLUB AVG=1.05 STD=0.06

128 GROUPS
SLAB AVG=1.31 STD=0.19
SLQB AVG=1.16 STD=0.36
SLUB AVG=1.29 STD=0.11

SLQB tends to be the winner here. SLAB is close at lower numbers of
groups, but drops behind a bit more as they increase.

tbench (10 runs, higher better):
1 THREAD
SLAB AVG=239.25 STD=31.74
SLQB AVG=257.75 STD=33.89
SLUB AVG=223.02 STD=14.73

2 THREADS
SLAB AVG=649.56 STD=9.77
SLQB AVG=647.77 STD=7.48
SLUB AVG=634.50 STD=7.66

4 THREADS
SLAB AVG=1294.52 STD=13.19
SLQB AVG=1266.58 STD=35.71
SLUB AVG=1228.31 STD=48.08

8 THREADS
SLAB AVG=2750.78 STD=26.67
SLQB AVG=2758.90 STD=18.86
SLUB AVG=2685.59 STD=22.41

16 THREADS
SLAB AVG=2669.11 STD=58.34
SLQB AVG=2671.69 STD=31.84
SLUB AVG=2571.05 STD=45.39

SLAB and SLQB seem to be pretty close, winning some and losing some.
They're always within a standard deviation of one another, so we can't
make conclusions between them. SLUB seems to be a bit slower.

Netperf UDP unidirectional send test (10 runs, higher better):

Server and client bound to same CPU
SLAB AVG=60.111 STD=1.59382
SLQB AVG=60.167 STD=0.685347
SLUB AVG=58.277 STD=0.788328

Server and client bound to same socket, different CPUs
SLAB AVG=85.938 STD=0.875794
SLQB AVG=93.662 STD=2.07434
SLUB AVG=81.983 STD=0.864362

Server and client bound to different sockets
SLAB AVG=78.801 STD=1.44118
SLQB AVG=78.269 STD=1.10457
SLUB AVG=71.334 STD=1.16809

SLQB is up with SLAB for the first and last cases, and faster in
the second case. SLUB trails in each case. (Any ideas for better types
of netperf tests?)

Kbuild numbers don't seem to be significantly different. SLAB and SLQB
actually got exactly the same average over 10 runs. The user+sys times
tend to be almost identical between allocators, with elapsed time mainly
depending on how much time the CPU was not idle.

Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
their measurement confidence interval. If it comes down to it, I think we
could get them to do more runs to narrow that down, but we're talking a
couple of tenths of a percent already.

I haven't done any non-local network tests. Networking is the one of the
subsystems most heavily dependent on slab performance, so if anybody
cares to run their favourite tests, that would be really helpful.

Disclaimer
----------
Now remember this is just one specific HW configuration, and some
allocators for some reason give significantly (and sometimes perplexingly)
different results between different CPU and system architectures.

The other frustrating thing is that sometimes you happen to get a lucky
or unlucky cache or NUMA layout depending on the compile, the boot, etc.
So sometimes results get a little "skewed" in a way that isn't reflected
in the STDDEV. But I've tried to minimise that. Dropping caches and
restarting services etc. between individual runs.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:46     ` Mainline kernel OLTP performance update Nick Piggin
@ 2009-01-16  6:55       ` Matthew Wilcox
  2009-01-16  7:06         ` Nick Piggin
  2009-01-16  7:53         ` Zhang, Yanmin
  2009-01-16  7:00       ` Mainline kernel OLTP performance update Andrew Morton
  2009-01-16 18:11       ` Rick Jones
  2 siblings, 2 replies; 42+ messages in thread
From: Matthew Wilcox @ 2009-01-16  6:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Zhang, Yanmin

On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> their measurement confidence interval. If it comes down to it, I think we
> could get them to do more runs to narrow that down, but we're talking a
> couple of tenths of a percent already.

I think I can speak with some measure of confidence for at least the
OLTP-testing part of my company when I say that I have no objection to
Nick's planned merge scheme.

I believe the kernel benchmark group have also done some testing with
SLQB and have generally positive things to say about it (Yanmin added to
the gargantuan cc).

Did slabtop get fixed to work with SLQB?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:55       ` Matthew Wilcox
@ 2009-01-16  7:06         ` Nick Piggin
  2009-01-16  7:53         ` Zhang, Yanmin
  1 sibling, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2009-01-16  7:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Zhang, Yanmin

On Friday 16 January 2009 17:55:47 Matthew Wilcox wrote:
> On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> > their measurement confidence interval. If it comes down to it, I think we
> > could get them to do more runs to narrow that down, but we're talking a
> > couple of tenths of a percent already.
>
> I think I can speak with some measure of confidence for at least the
> OLTP-testing part of my company when I say that I have no objection to
> Nick's planned merge scheme.
>
> I believe the kernel benchmark group have also done some testing with
> SLQB and have generally positive things to say about it (Yanmin added to
> the gargantuan cc).
>
> Did slabtop get fixed to work with SLQB?

Yes the old slabtop that works on /proc/slabinfo works with SLQB (ie. SLQB
implements /proc/slabinfo).

Lin Ming recently also ported the SLUB /sys/kernel/slab/ specific slabinfo
tool to SLQB. Basically it reports in-depth internal event counts etc. and
can operate on individual caches, making it very useful for performance
"observability" and tuning.

It is hard to come up with a single set of statistics that apply usefully
to all the allocators. FWIW, it would be a useful tool to port over to
SLAB too, if we end up deciding to go with SLAB.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:55       ` Matthew Wilcox
  2009-01-16  7:06         ` Nick Piggin
@ 2009-01-16  7:53         ` Zhang, Yanmin
  2009-01-16 10:20           ` Andi Kleen
  1 sibling, 1 reply; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-16  7:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Thu, 2009-01-15 at 23:55 -0700, Matthew Wilcox wrote:
> On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> > their measurement confidence interval. If it comes down to it, I think we
> > could get them to do more runs to narrow that down, but we're talking a
> > couple of tenths of a percent already.
> 
> I think I can speak with some measure of confidence for at least the
> OLTP-testing part of my company when I say that I have no objection to
> Nick's planned merge scheme.
> 
> I believe the kernel benchmark group have also done some testing with
> SLQB and have generally positive things to say about it (Yanmin added to
> the gargantuan cc).
We did run lots of benchmarks with SLQB. Comparing with SLUB, one highlighting of
SLQB is with netperf UDP-U-4k. On my x86-64 machines, if I start 1 client and 1 server
process and bind them to different physical cpus, the result of SLQB is about 20% better
than SLUB's. If I start CPU_NUM clients and the same number of servers without binding,
the results of SLQB is about 100% better than SLUB's. I think that's because SLQB
doesn't pass through big object allocation to page allocator.
netperf UDP-U-1k has less improvement with SLQB.

The results of other benchmarks have variations. They are good on some machines,
but bad on other machines. However, the variation is small. For example, hackbench's result
with SLQB is about 1 second than with SLUB on 8-core stoakley. After we worked with
Nick to do small code changing, SLQB's result is a little better than SLUB's
with hackbench on stoakley.

We consider other variations as fluctuation.

All the testing use default SLUB and SLQB configuration.

> 
> Did slabtop get fixed to work with SLQB?
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  7:53         ` Zhang, Yanmin
@ 2009-01-16 10:20           ` Andi Kleen
  2009-01-20  5:16             ` Zhang, Yanmin
  0 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2009-01-16 10:20 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr,
	matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes:


> I think that's because SLQB
> doesn't pass through big object allocation to page allocator.
> netperf UDP-U-1k has less improvement with SLQB.

That sounds like just the page allocator needs to be improved.
That would help everyone. We talked a bit about this earlier,
some of the heuristics for hot/cold pages are quite outdated
and have been tuned for obsolete machines and also its fast path
is quite long. Unfortunately no code currently.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 10:20           ` Andi Kleen
@ 2009-01-20  5:16             ` Zhang, Yanmin
  2009-01-21 23:58               ` Christoph Lameter
  0 siblings, 1 reply; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-20  5:16 UTC (permalink / raw)
  To: Andi Kleen, Christoph Lameter, Pekka Enberg
  Cc: Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr,
	matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Fri, 2009-01-16 at 11:20 +0100, Andi Kleen wrote:
> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes:
> 
> 
> > I think that's because SLQB
> > doesn't pass through big object allocation to page allocator.
> > netperf UDP-U-1k has less improvement with SLQB.
> 
> That sounds like just the page allocator needs to be improved.
> That would help everyone. We talked a bit about this earlier,
> some of the heuristics for hot/cold pages are quite outdated
> and have been tuned for obsolete machines and also its fast path
> is quite long. Unfortunately no code currently.
Andi,

Thanks for your kind information. I did more investigation with SLUB
on netperf UDP-U-4k issue.

oprofile shows:
328058   30.1342  linux-2.6.29-rc2         copy_user_generic_string
134666   12.3699  linux-2.6.29-rc2         __free_pages_ok
125447   11.5231  linux-2.6.29-rc2         get_page_from_freelist
22611     2.0770  linux-2.6.29-rc2         __sk_mem_reclaim
21442     1.9696  linux-2.6.29-rc2         list_del
21187     1.9462  linux-2.6.29-rc2         __ip_route_output_key

So __free_pages_ok and get_page_from_freelist consume too much cpu time.
With SLQB, these 2 functions almost don't consume time.

Command 'slabinfo -AD' shows:
Name                   Objects    Alloc     Free   %Fast
:0000256                  1685 29611065 29609548  99  99
:0000168                  2987   164689   161859  94  39
:0004096                  1471   114918   113490  99  97

So kmem_cache :0000256 is very active.

Kernel stack dump in __free_pages_ok shows
 [<ffffffff8027010f>] __free_pages_ok+0x109/0x2e0
 [<ffffffff8024bb34>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8060f387>] __kfree_skb+0x9/0x6f
 [<ffffffff8061204b>] skb_free_datagram+0xc/0x31
 [<ffffffff8064b528>] udp_recvmsg+0x1e7/0x26f
 [<ffffffff8060b509>] sock_common_recvmsg+0x30/0x45
 [<ffffffff80609acd>] sock_recvmsg+0xd5/0xed

The callchain is:
__kfree_skb =>
	kfree_skbmem =>
		kmem_cache_free(skbuff_head_cache, skb);

kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
with :0000256. Their order is 1 which means every slab consists of 2 physical pages.

netperf UDP-U-4k is a UDP stream testing. client process keeps sending 4k-size packets
to server process and server process just receives the packets one by one.

If we start CPU_NUM clients and the same number of servers, every client will send lots
of packets within one sched slice, then process scheduler schedules the server to receive
many packets within one sched slice; then client resends again. So there are many packets
in the queue. When server receive the packets, it frees skbuff_head_cache. When the slab's
objects are all free, the slab will be released by calling __free_pages. Such batch
sending/receiving creates lots of slab free activity.

Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.

SLQB has no such issue, because:
1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
later on quickly without lock. A batch parameter to control the free object recollection is mostly
1024.
2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
benefit from zone_pcp(zone, cpu)->pcp page buffer.

So SLUB need resolve such issues that one process allocates a batch of objects and another process
frees them batchly.

yanmin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-20  5:16             ` Zhang, Yanmin
@ 2009-01-21 23:58               ` Christoph Lameter
  2009-01-22  8:36                 ` Zhang, Yanmin
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Lameter @ 2009-01-21 23:58 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andi Kleen, Pekka Enberg, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1708 bytes --]

On Tue, 20 Jan 2009, Zhang, Yanmin wrote:

> kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> with :0000256. Their order is 1 which means every slab consists of 2 physical pages.

That order can be changed. Try specifying slub_max_order=0 on the kernel
command line to force an order 0 alloc.

The queues of the page allocator are of limited use due to their overhead.
Order-1 allocations can actually be 5% faster than order-0. order-0 makes
sense if pages are pushed rapidly to the page allocator and are then
reissues elsewhere. If there is a linear consumption then the page
allocator queues are just overhead.

> Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
> But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.

That usually does not matter because of partial list avoiding page
allocator actions.

> SLQB has no such issue, because:
> 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
> later on quickly without lock. A batch parameter to control the free object recollection is mostly
> 1024.
> 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
> benefit from zone_pcp(zone, cpu)->pcp page buffer.
>
> So SLUB need resolve such issues that one process allocates a batch of objects and another process
> frees them batchly.

SLUB has a percpu freelist but its bounded by the basic allocation unit.
You can increase that by modifying the allocation order. Writing a 3 or 5
into the order value in /sys/kernel/slab/xxx/order would do the trick.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-21 23:58               ` Christoph Lameter
@ 2009-01-22  8:36                 ` Zhang, Yanmin
  2009-01-22  9:15                   ` Pekka Enberg
  0 siblings, 1 reply; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-22  8:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> 
> > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> 
> That order can be changed. Try specifying slub_max_order=0 on the kernel
> command line to force an order 0 alloc.
I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.

I checked my instrumentation in kernel and found it's caused by large object allocation/free
whose size is more than PAGE_SIZE. Here its order is 1.

The right free callchain is __kfree_skb => skb_release_all => skb_release_data.

So this case isn't the issue that batch of allocation/free might erase partial page
functionality.

'#slaninfo -AD' couldn't show statistics of large object allocation/free. Can we add
such info? That will be more helpful.

In addition, I didn't find such issue wih TCP stream testing.

> 
> The queues of the page allocator are of limited use due to their overhead.
> Order-1 allocations can actually be 5% faster than order-0. order-0 makes
> sense if pages are pushed rapidly to the page allocator and are then
> reissues elsewhere. If there is a linear consumption then the page
> allocator queues are just overhead.
> 
> > Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
> > But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.
> 
> That usually does not matter because of partial list avoiding page
> allocator actions.

> 
> > SLQB has no such issue, because:
> > 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
> > later on quickly without lock. A batch parameter to control the free object recollection is mostly
> > 1024.
> > 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
> > benefit from zone_pcp(zone, cpu)->pcp page buffer.
> >
> > So SLUB need resolve such issues that one process allocates a batch of objects and another process
> > frees them batchly.
> 
> SLUB has a percpu freelist but its bounded by the basic allocation unit.
> You can increase that by modifying the allocation order. Writing a 3 or 5
> into the order value in /sys/kernel/slab/xxx/order would do the trick.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-22  8:36                 ` Zhang, Yanmin
@ 2009-01-22  9:15                   ` Pekka Enberg
  2009-01-22  9:28                     ` Zhang, Yanmin
  0 siblings, 1 reply; 42+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:15 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > 
> > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > 
> > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > command line to force an order 0 alloc.
> I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> 
> I checked my instrumentation in kernel and found it's caused by large object allocation/free
> whose size is more than PAGE_SIZE. Here its order is 1.
> 
> The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> 
> So this case isn't the issue that batch of allocation/free might erase partial page
> functionality.

So is this the kfree(skb->head) in skb_release_data() or the put_page()
calls in the same function in a loop?

If it's the former, with big enough size passed to __alloc_skb(), the
networking code might be taking a hit from the SLUB page allocator
pass-through.

		Pekka

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-22  9:15                   ` Pekka Enberg
@ 2009-01-22  9:28                     ` Zhang, Yanmin
  2009-01-22  9:47                       ` Pekka Enberg
  0 siblings, 1 reply; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-22  9:28 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > 
> > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > 
> > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > command line to force an order 0 alloc.
> > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > 
> > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > whose size is more than PAGE_SIZE. Here its order is 1.
> > 
> > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > 
> > So this case isn't the issue that batch of allocation/free might erase partial page
> > functionality.
> 
> So is this the kfree(skb->head) in skb_release_data() or the put_page()
> calls in the same function in a loop?
It's kfree(skb->head).

> 
> If it's the former, with big enough size passed to __alloc_skb(), the
> networking code might be taking a hit from the SLUB page allocator
> pass-through.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-22  9:28                     ` Zhang, Yanmin
@ 2009-01-22  9:47                       ` Pekka Enberg
  2009-01-23  3:02                         ` Zhang, Yanmin
  0 siblings, 1 reply; 42+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:47 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > > 
> > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > > 
> > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > command line to force an order 0 alloc.
> > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > > 
> > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > whose size is more than PAGE_SIZE. Here its order is 1.
> > > 
> > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > > 
> > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > functionality.
> > 
> > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > calls in the same function in a loop?
> It's kfree(skb->head).
> 
> > 
> > If it's the former, with big enough size passed to __alloc_skb(), the
> > networking code might be taking a hit from the SLUB page allocator
> > pass-through.

Do we know what kind of size is being passed to __alloc_skb() in this
case? Maybe we want to do something like this.

		Pekka

SLUB: revert page allocator pass-through

This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
direct pass through of page size or higher kmalloc requests").
---

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..3bd3662 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -124,7 +124,7 @@ struct kmem_cache {
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size)
 	if (!size)
 		return 0;
 
+	if (size > KMALLOC_MAX_SIZE)
+		return -1;
+
 	if (size <= KMALLOC_MIN_SIZE)
 		return KMALLOC_SHIFT_LOW;
 
@@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size)
 	if (size <=       1024) return 10;
 	if (size <=   2 * 1024) return 11;
 	if (size <=   4 * 1024) return 12;
-/*
- * The following is only needed to support architectures with a larger page
- * size than 4k.
- */
 	if (size <=   8 * 1024) return 13;
 	if (size <=  16 * 1024) return 14;
 	if (size <=  32 * 1024) return 15;
@@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size)
 	if (size <= 512 * 1024) return 19;
 	if (size <= 1024 * 1024) return 20;
 	if (size <=  2 * 1024 * 1024) return 21;
+	if (size <=  4 * 1024 * 1024) return 22;
+	if (size <=  8 * 1024 * 1024) return 23;
+	if (size <= 16 * 1024 * 1024) return 24;
+	if (size <= 32 * 1024 * 1024) return 25;
 	return -1;
 
 /*
@@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 	if (index == 0)
 		return NULL;
 
+	/*
+	 * This function only gets expanded if __builtin_constant_p(size), so
+	 * testing it here shouldn't be needed.  But some versions of gcc need
+	 * help.
+	 */
+	if (__builtin_constant_p(size) && index < 0) {
+		/*
+		 * Generate a link failure. Would be great if we could
+		 * do something to stop the compile here.
+		 */
+		extern void __kmalloc_size_too_large(void);
+		__kmalloc_size_too_large();
+	}
 	return &kmalloc_caches[index];
 }
 
@@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
-static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
-{
-	return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
-}
-
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-		if (size > PAGE_SIZE)
-			return kmalloc_large(size, flags);
-
 		if (!(flags & SLUB_DMA)) {
 			struct kmem_cache *s = kmalloc_slab(size);
 
diff --git a/mm/slub.c b/mm/slub.c
index 6392ae5..8fad23f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2537,7 +2537,7 @@ panic:
 }
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
+static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1];
 
 static void sysfs_add_func(struct work_struct *w)
 {
@@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 			return ZERO_SIZE_PTR;
 
 		index = size_index[(size - 1) / 8];
-	} else
+	} else {
+		if (size > KMALLOC_MAX_SIZE)
+			return NULL;
+
 		index = fls(size - 1);
+	}
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
@@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, flags);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc);
 
-static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
-{
-	struct page *page = alloc_pages_node(node, flags | __GFP_COMP,
-						get_order(size));
-
-	if (page)
-		return page_address(page);
-	else
-		return NULL;
-}
-
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, flags, node);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2746,11 +2733,8 @@ void kfree(const void *x)
 		return;
 
 	page = virt_to_head_page(x);
-	if (unlikely(!PageSlab(page))) {
-		BUG_ON(!PageCompound(page));
-		put_page(page);
+	if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */
 		return;
-	}
 	slab_free(page->slab, page, object, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
@@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void)
 		caches++;
 	}
 
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
 			"kmalloc", 1 << i, GFP_KERNEL);
 		caches++;
@@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void)
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
 
@@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, gfpflags);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, gfpflags, node);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-22  9:47                       ` Pekka Enberg
@ 2009-01-23  3:02                         ` Zhang, Yanmin
  2009-01-23  6:52                           ` Pekka Enberg
  2009-01-23  8:33                           ` Nick Piggin
  0 siblings, 2 replies; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-23  3:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Thu, 2009-01-22 at 11:47 +0200, Pekka Enberg wrote:
> On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > > > 
> > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > > > 
> > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > > command line to force an order 0 alloc.
> > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > > > 
> > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > > whose size is more than PAGE_SIZE. Here its order is 1.
> > > > 
> > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > > > 
> > > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > > functionality.
> > > 
> > > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > > calls in the same function in a loop?
> > It's kfree(skb->head).
> > 
> > > 
> > > If it's the former, with big enough size passed to __alloc_skb(), the
> > > networking code might be taking a hit from the SLUB page allocator
> > > pass-through.
> 
> Do we know what kind of size is being passed to __alloc_skb() in this
> case?
In function __alloc_skb, original parameter size=4155,
SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
__kmalloc_track_caller's parameter size=4696.

>  Maybe we want to do something like this.
> 
> 		Pekka
> 
> SLUB: revert page allocator pass-through
This patch amost fixes the netperf UDP-U-4k issue.

#slabinfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000256                  1658 70350463 70348946  99  99 
kmalloc-8192                31 70322309 70322293  99  99 
:0000168                  2592   143154   140684  93  28 
:0004096                  1456    91072    89644  99  96 
:0000192                  3402    63838    60491  89  11 
:0000064                  6177    49635    43743  98  77 

So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
kmalloc-8192's default order on my 8-core stoakley is 2.

1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
is about 10% better than SLUB's.

I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

> 
> This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
> direct pass through of page size or higher kmalloc requests").
> ---
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 2f5c16b..3bd3662 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  3:02                         ` Zhang, Yanmin
@ 2009-01-23  6:52                           ` Pekka Enberg
  2009-01-23  8:06                             ` Pekka Enberg
  2009-01-23  8:33                           ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Pekka Enberg @ 2009-01-23  6:52 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty, mingo

Zhang, Yanmin wrote:
>>>> If it's the former, with big enough size passed to __alloc_skb(), the
>>>> networking code might be taking a hit from the SLUB page allocator
>>>> pass-through.
>> Do we know what kind of size is being passed to __alloc_skb() in this
>> case?
> In function __alloc_skb, original parameter size=4155,
> SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
> __kmalloc_track_caller's parameter size=4696.

OK, so all allocations go straight to the page allocator.

> 
>>  Maybe we want to do something like this.
>>
>> SLUB: revert page allocator pass-through
> This patch amost fixes the netperf UDP-U-4k issue.
> 
> #slabinfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1658 70350463 70348946  99  99 
> kmalloc-8192                31 70322309 70322293  99  99 
> :0000168                  2592   143154   140684  93  28 
> :0004096                  1456    91072    89644  99  96 
> :0000192                  3402    63838    60491  89  11 
> :0000064                  6177    49635    43743  98  77 
> 
> So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
> kmalloc-8192's default order on my 8-core stoakley is 2.

Christoph, should we merge my patch as-is or do you have an alternative 
fix in mind? We could, of course, increase kmalloc() caches one level up 
to 8192 or higher.

> 
> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> is about 10% better than SLUB's.
> 
> I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
counters patch to diagnose this:

http://lkml.org/lkml/2009/1/21/273

And do oprofile, of course. Thanks!

		Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  6:52                           ` Pekka Enberg
@ 2009-01-23  8:06                             ` Pekka Enberg
  2009-01-23  8:30                               ` Zhang, Yanmin
  0 siblings, 1 reply; 42+ messages in thread
From: Pekka Enberg @ 2009-01-23  8:06 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty, mingo

On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > is about 10% better than SLUB's.
> > 
> > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> 
> Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> counters patch to diagnose this:
> 
> http://lkml.org/lkml/2009/1/21/273
> 
> And do oprofile, of course. Thanks!

I assume binding the client and the server to different physical CPUs
also  means that the SKB is always allocated on CPU 1 and freed on CPU
2? If so, we will be taking the __slab_free() slow path all the time on
kfree() which will cause cache effects, no doubt.

But there's another potential performance hit we're taking because the
object size of the cache is so big. As allocations from CPU 1 keep
coming in, we need to allocate new pages and unfreeze the per-cpu page.
That in turn causes __slab_free() to be more eager to discard the slab
(see the PageSlubFrozen check there).

So before going for cache profiling, I'd really like to see an oprofile
report. I suspect we're still going to see much more page allocator
activity there than with SLAB or SLQB which is why we're still behaving
so badly here.

		Pekka

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  8:06                             ` Pekka Enberg
@ 2009-01-23  8:30                               ` Zhang, Yanmin
  2009-01-23  8:40                                 ` Pekka Enberg
  2009-01-23  9:46                                 ` Pekka Enberg
  0 siblings, 2 replies; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-23  8:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty, mingo

On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > is about 10% better than SLUB's.
> > > 
> > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> > 
> > Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> > counters patch to diagnose this:
> > 
> > http://lkml.org/lkml/2009/1/21/273
> > 
> > And do oprofile, of course. Thanks!
> 
> I assume binding the client and the server to different physical CPUs
> also  means that the SKB is always allocated on CPU 1 and freed on CPU
> 2? If so, we will be taking the __slab_free() slow path all the time on
> kfree() which will cause cache effects, no doubt.
> 
> But there's another potential performance hit we're taking because the
> object size of the cache is so big. As allocations from CPU 1 keep
> coming in, we need to allocate new pages and unfreeze the per-cpu page.
> That in turn causes __slab_free() to be more eager to discard the slab
> (see the PageSlubFrozen check there).
> 
> So before going for cache profiling, I'd really like to see an oprofile
> report. I suspect we're still going to see much more page allocator
> activity
Theoretically, it should, but oprofile doesn't show that.

>  there than with SLAB or SLQB which is why we're still behaving
> so badly here.

oprofile output with 2.6.29-rc2-slubrevertlarge:
CPU: Core 2, speed 2666.71 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        app name                 symbol name
132779   32.9951  vmlinux                  copy_user_generic_string
25334     6.2954  vmlinux                  schedule
21032     5.2264  vmlinux                  tg_shares_up
17175     4.2679  vmlinux                  __skb_recv_datagram
9091      2.2591  vmlinux                  sock_def_readable
8934      2.2201  vmlinux                  mwait_idle
8796      2.1858  vmlinux                  try_to_wake_up
6940      1.7246  vmlinux                  __slab_free

#slaninfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000256                  1643  5215544  5214027  94   0 
kmalloc-8192                28  5189576  5189560   0   0 
:0000168                  2631   141466   138976  92  28 
:0004096                  1452    88697    87269  99  96 
:0000192                  3402    63050    59732  89  11 
:0000064                  6265    46611    40721  98  82 
:0000128                  1895    30429    28654  93  32 


oprofile output with kernel 2.6.29-rc2-slqb0121:
CPU: Core 2, speed 2666.76 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               app name                 symbol name
114793   28.7163  vmlinux                  vmlinux                  copy_user_generic_string
27880     6.9744  vmlinux                  vmlinux                  tg_shares_up
22218     5.5580  vmlinux                  vmlinux                  schedule
12238     3.0614  vmlinux                  vmlinux                  mwait_idle
7395      1.8499  vmlinux                  vmlinux                  task_rq_lock
7348      1.8382  vmlinux                  vmlinux                  sock_def_readable
7202      1.8016  vmlinux                  vmlinux                  sched_clock_cpu
6981      1.7464  vmlinux                  vmlinux                  __skb_recv_datagram
6566      1.6425  vmlinux                  vmlinux                  udp_queue_rcv_skb



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  8:30                               ` Zhang, Yanmin
@ 2009-01-23  8:40                                 ` Pekka Enberg
  2009-01-23  9:46                                 ` Pekka Enberg
  1 sibling, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-01-23  8:40 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty, mingo

On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> > I assume binding the client and the server to different physical CPUs
> > also  means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> > 
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> > 
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.
> 
> > there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
> 
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        app name                 symbol name
> 132779   32.9951  vmlinux                  copy_user_generic_string
> 25334     6.2954  vmlinux                  schedule
> 21032     5.2264  vmlinux                  tg_shares_up
> 17175     4.2679  vmlinux                  __skb_recv_datagram
> 9091      2.2591  vmlinux                  sock_def_readable
> 8934      2.2201  vmlinux                  mwait_idle
> 8796      2.1858  vmlinux                  try_to_wake_up
> 6940      1.7246  vmlinux                  __slab_free
> 
> #slaninfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1643  5215544  5214027  94   0 
> kmalloc-8192                28  5189576  5189560   0   0 
                                                    ^^^^^^

This looks bit funny. Hmm.

> :0000168                  2631   141466   138976  92  28 
> :0004096                  1452    88697    87269  99  96 
> :0000192                  3402    63050    59732  89  11 
> :0000064                  6265    46611    40721  98  82 
> :0000128                  1895    30429    28654  93  32 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  8:30                               ` Zhang, Yanmin
  2009-01-23  8:40                                 ` Pekka Enberg
@ 2009-01-23  9:46                                 ` Pekka Enberg
  2009-01-23 15:22                                   ` Christoph Lameter
  1 sibling, 1 reply; 42+ messages in thread
From: Pekka Enberg @ 2009-01-23  9:46 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty, mingo

On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > > is about 10% better than SLUB's.
> > > > 
> > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> > > 
> > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> > > counters patch to diagnose this:
> > > 
> > > http://lkml.org/lkml/2009/1/21/273
> > > 
> > > And do oprofile, of course. Thanks!
> > 
> > I assume binding the client and the server to different physical CPUs
> > also  means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> > 
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> > 
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.

That's bit surprising, actually. FWIW, I've included a patch for empty
slab lists. But it's probably not going to help here.

> >  there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
> 
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        app name                 symbol name
> 132779   32.9951  vmlinux                  copy_user_generic_string
> 25334     6.2954  vmlinux                  schedule
> 21032     5.2264  vmlinux                  tg_shares_up
> 17175     4.2679  vmlinux                  __skb_recv_datagram
> 9091      2.2591  vmlinux                  sock_def_readable
> 8934      2.2201  vmlinux                  mwait_idle
> 8796      2.1858  vmlinux                  try_to_wake_up
> 6940      1.7246  vmlinux                  __slab_free
> 
> #slaninfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1643  5215544  5214027  94   0 
> kmalloc-8192                28  5189576  5189560   0   0 
> :0000168                  2631   141466   138976  92  28 
> :0004096                  1452    88697    87269  99  96 
> :0000192                  3402    63050    59732  89  11 
> :0000064                  6265    46611    40721  98  82 
> :0000128                  1895    30429    28654  93  32 

Looking at __slab_free(), unless page->inuse is constantly zero and we
discard the slab, it really is just cache effects (10% sounds like a
lot, though!). AFAICT, the only way to optimize that is with Christoph's
unfinished pointer freelists patches or with a remote free list like in
SLQB.

		Pekka

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 3bd3662..41a4c1a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -48,6 +48,9 @@ struct kmem_cache_node {
 	unsigned long nr_partial;
 	unsigned long min_partial;
 	struct list_head partial;
+	unsigned long nr_empty;
+	unsigned long max_empty;
+	struct list_head empty;
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_t nr_slabs;
 	atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index 8fad23f..5a12597 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,11 @@
  */
 #define MAX_PARTIAL 10
 
+/*
+ * Maximum number of empty slabs.
+ */
+#define MAX_EMPTY 1
+
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
 
@@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
 	free_slab(s, page);
 }
 
+static void discard_or_cache_slab(struct kmem_cache *s, struct page *page)
+{
+	struct kmem_cache_node *n;
+	int node;
+
+	node = page_to_nid(page);
+	n = get_node(s, node);
+
+	dec_slabs_node(s, node, page->objects);
+
+	if (likely(n->nr_empty >= n->max_empty)) {
+		free_slab(s, page);
+	} else {
+		n->nr_empty++;
+		list_add(&page->lru, &n->partial);
+	}
+}
+
 /*
  * Per slab locking using the pagelock
  */
@@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
 }
 
 /*
- * Lock slab and remove from the partial list.
+ * Lock slab and remove from the partial or empty list.
  *
  * Must hold list_lock.
  */
@@ -1261,7 +1284,6 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
 {
 	if (slab_trylock(page)) {
 		list_del(&page->lru);
-		n->nr_partial--;
 		__SetPageSlubFrozen(page);
 		return 1;
 	}
@@ -1271,7 +1293,7 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
 /*
  * Try to allocate a partial slab from a specific node.
  */
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial_or_empty_node(struct kmem_cache_node *n)
 {
 	struct page *page;
 
@@ -1281,13 +1303,22 @@ static struct page *get_partial_node(struct kmem_cache_node *n)
 	 * partial slab and there is none available then get_partials()
 	 * will return NULL.
 	 */
-	if (!n || !n->nr_partial)
+	if (!n || (!n->nr_partial && !n->nr_empty))
 		return NULL;
 
 	spin_lock(&n->list_lock);
+
 	list_for_each_entry(page, &n->partial, lru)
-		if (lock_and_freeze_slab(n, page))
+		if (lock_and_freeze_slab(n, page)) {
+			n->nr_partial--;
+			goto out;
+		}
+
+	list_for_each_entry(page, &n->empty, lru)
+		if (lock_and_freeze_slab(n, page)) {
+			n->nr_empty--;
 			goto out;
+		}
 	page = NULL;
 out:
 	spin_unlock(&n->list_lock);
@@ -1297,7 +1328,7 @@ out:
 /*
  * Get a page from somewhere. Search in increasing NUMA distances.
  */
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial_or_empty(struct kmem_cache *s, gfp_t flags)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -1336,7 +1367,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 
 		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > n->min_partial) {
-			page = get_partial_node(n);
+			page = get_partial_or_empty_node(n);
 			if (page)
 				return page;
 		}
@@ -1346,18 +1377,19 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 }
 
 /*
- * Get a partial page, lock it and return it.
+ * Get a partial or empty page, lock it and return it.
  */
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *
+get_partial_or_empty(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct page *page;
 	int searchnode = (node == -1) ? numa_node_id() : node;
 
-	page = get_partial_node(get_node(s, searchnode));
+	page = get_partial_or_empty_node(get_node(s, searchnode));
 	if (page || (flags & __GFP_THISNODE))
 		return page;
 
-	return get_any_partial(s, flags);
+	return get_any_partial_or_empty(s, flags);
 }
 
 /*
@@ -1403,7 +1435,7 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 		} else {
 			slab_unlock(page);
 			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
-			discard_slab(s, page);
+			discard_or_cache_slab(s, page);
 		}
 	}
 }
@@ -1542,7 +1574,7 @@ another_slab:
 	deactivate_slab(s, c);
 
 new_slab:
-	new = get_partial(s, gfpflags, node);
+	new = get_partial_or_empty(s, gfpflags, node);
 	if (new) {
 		c->page = new;
 		stat(c, ALLOC_FROM_PARTIAL);
@@ -1693,7 +1725,7 @@ slab_empty:
 	}
 	slab_unlock(page);
 	stat(c, FREE_SLAB);
-	discard_slab(s, page);
+	discard_or_cache_slab(s, page);
 	return;
 
 debug:
@@ -1927,6 +1959,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
+	spin_lock_init(&n->list_lock);
+
 	n->nr_partial = 0;
 
 	/*
@@ -1939,8 +1973,18 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 	else if (n->min_partial > MAX_PARTIAL)
 		n->min_partial = MAX_PARTIAL;
 
-	spin_lock_init(&n->list_lock);
 	INIT_LIST_HEAD(&n->partial);
+
+	n->nr_empty = 0;
+	/*
+	 * XXX: This needs to take object size into account. We don't need
+	 * empty slabs for caches which will have plenty of partial slabs
+	 * available. Only caches that have either full or empty slabs need
+	 * this kind of optimization.
+	 */
+	n->max_empty = MAX_EMPTY;
+	INIT_LIST_HEAD(&n->empty);
+
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_set(&n->nr_slabs, 0);
 	atomic_long_set(&n->total_objects, 0);
@@ -2427,6 +2471,32 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 	spin_unlock_irqrestore(&n->list_lock, flags);
 }
 
+static void free_empty_slabs(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+		struct page *page, *t;
+		unsigned long flags;
+
+		n = get_node(s, node);
+
+		if (!n->nr_empty)
+			continue;
+
+		spin_lock_irqsave(&n->list_lock, flags);
+
+		list_for_each_entry_safe(page, t, &n->empty, lru) {
+			list_del(&page->lru);
+			n->nr_empty--;
+
+			free_slab(s, page);
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+}
+
 /*
  * Release all resources used by a slab cache.
  */
@@ -2436,6 +2506,8 @@ static inline int kmem_cache_close(struct kmem_cache *s)
 
 	flush_all(s);
 
+	free_empty_slabs(s);
+
 	/* Attempt to free all objects */
 	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -2765,6 +2837,7 @@ int kmem_cache_shrink(struct kmem_cache *s)
 		return -ENOMEM;
 
 	flush_all(s);
+	free_empty_slabs(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		n = get_node(s, node);
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  9:46                                 ` Pekka Enberg
@ 2009-01-23 15:22                                   ` Christoph Lameter
  2009-01-23 15:31                                     ` Pekka Enberg
  2009-01-24  2:55                                     ` Zhang, Yanmin
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> Looking at __slab_free(), unless page->inuse is constantly zero and we
> discard the slab, it really is just cache effects (10% sounds like a
> lot, though!). AFAICT, the only way to optimize that is with Christoph's
> unfinished pointer freelists patches or with a remote free list like in
> SLQB.

No there is another way. Increase the allocator order to 3 for the
kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
larger chunks of data gotten from the page allocator. That will allow slub
to do fast allocs.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23 15:22                                   ` Christoph Lameter
@ 2009-01-23 15:31                                     ` Pekka Enberg
  2009-01-23 15:55                                       ` Christoph Lameter
  2009-01-24  2:55                                     ` Zhang, Yanmin
  1 sibling, 1 reply; 42+ messages in thread
From: Pekka Enberg @ 2009-01-23 15:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Pekka Enberg wrote:
> 
> > Looking at __slab_free(), unless page->inuse is constantly zero and we
> > discard the slab, it really is just cache effects (10% sounds like a
> > lot, though!). AFAICT, the only way to optimize that is with Christoph's
> > unfinished pointer freelists patches or with a remote free list like in
> > SLQB.
> 
> No there is another way. Increase the allocator order to 3 for the
> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> larger chunks of data gotten from the page allocator. That will allow slub
> to do fast allocs.

I wonder why that doesn't happen already, actually. The slub_max_order
know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
order 3 should be as good fit as order 2 so 'fraction' can't be too high
either. Hmm.

		Pekka


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23 15:31                                     ` Pekka Enberg
@ 2009-01-23 15:55                                       ` Christoph Lameter
  2009-01-23 16:01                                         ` Pekka Enberg
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> I wonder why that doesn't happen already, actually. The slub_max_order
> know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
> order 3 should be as good fit as order 2 so 'fraction' can't be too high
> either. Hmm.

The kmalloc-8192 is new. Look at slabinfo output to see what allocation
orders are chosen.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23 15:55                                       ` Christoph Lameter
@ 2009-01-23 16:01                                         ` Pekka Enberg
  0 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-01-23 16:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 23 Jan 2009, Pekka Enberg wrote:
> > I wonder why that doesn't happen already, actually. The slub_max_order
> > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
> > order 3 should be as good fit as order 2 so 'fraction' can't be too high
> > either. Hmm.

On Fri, 2009-01-23 at 10:55 -0500, Christoph Lameter wrote:
> The kmalloc-8192 is new. Look at slabinfo output to see what allocation
> orders are chosen.

Yes, yes, I know the new cache a result of my patch. I'm just saying
that AFAICT, the existing logic should set the order to 3 but IIRC
Yanmin said it's 2.

			Pekka

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23 15:22                                   ` Christoph Lameter
  2009-01-23 15:31                                     ` Pekka Enberg
@ 2009-01-24  2:55                                     ` Zhang, Yanmin
  2009-01-24  7:36                                       ` Pekka Enberg
  2009-01-26 17:36                                       ` Christoph Lameter
  1 sibling, 2 replies; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-24  2:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Pekka Enberg wrote:
> 
> > Looking at __slab_free(), unless page->inuse is constantly zero and we
> > discard the slab, it really is just cache effects (10% sounds like a
> > lot, though!). AFAICT, the only way to optimize that is with Christoph's
> > unfinished pointer freelists patches or with a remote free list like in
> > SLQB.
> 
> No there is another way. Increase the allocator order to 3 for the
> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> larger chunks of data gotten from the page allocator. That will allow slub
> to do fast allocs.
After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.

But when trying to increased it to 4, I got:
[root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
[root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
-bash: echo: write error: Invalid argument

Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning
against specific benchmarks. One hard is to tune page order number. Although SLQB also
has many tuning options, I almost doesn't tune it manually, just run benchmark and
collect results to compare. Does that mean the scalability of SLQB is better?



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-24  2:55                                     ` Zhang, Yanmin
@ 2009-01-24  7:36                                       ` Pekka Enberg
  2009-02-12  5:22                                         ` Zhang, Yanmin
  2009-01-26 17:36                                       ` Christoph Lameter
  1 sibling, 1 reply; 42+ messages in thread
From: Pekka Enberg @ 2009-01-24  7:36 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
>> No there is another way. Increase the allocator order to 3 for the
>> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
>> larger chunks of data gotten from the page allocator. That will allow slub
>> to do fast allocs.

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<yanmin_zhang@linux.intel.com> wrote:
> After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.

Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
Are you interested in doing that?

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<yanmin_zhang@linux.intel.com> wrote:
> But when trying to increased it to 4, I got:
> [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
> [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
> -bash: echo: write error: Invalid argument

That's probably because max order is capped to 3. You can change that
by passing slub_max_order=<n> as kernel parameter.

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<yanmin_zhang@linux.intel.com> wrote:
> Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning
> against specific benchmarks. One hard is to tune page order number. Although SLQB also
> has many tuning options, I almost doesn't tune it manually, just run benchmark and
> collect results to compare. Does that mean the scalability of SLQB is better?

One thing is sure, SLUB seems to be hard to tune. Probably because
it's dependent on the page order so much.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-24  7:36                                       ` Pekka Enberg
@ 2009-02-12  5:22                                         ` Zhang, Yanmin
  2009-02-12  5:47                                           ` Zhang, Yanmin
  0 siblings, 1 reply; 42+ messages in thread
From: Zhang, Yanmin @ 2009-02-12  5:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote:
> On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> >> No there is another way. Increase the allocator order to 3 for the
> >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> >> larger chunks of data gotten from the page allocator. That will allow slub
> >> to do fast allocs.
> 
> On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
> <yanmin_zhang@linux.intel.com> wrote:
> > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.
> 
> Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
> Are you interested in doing that?
Pekka,

Sorry for the late update.
The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order.


slab_size	order		name
-------------------------------------------------
4096            3               sgpool-128
8192            2               kmalloc-8192
16384           3               kmalloc-16384

kmalloc-8192's default order is smaller than sgpool-128's.

On 4*4 tigerton machine, a similiar issue appears on another kmem_cache.

Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking
in slab_order, sometimes above issue appear.

Below patch against 2.6.29-rc2 fixes it.

I checked the default orders of all kmem_cache and they don't become smaller than before. So
the patch wouldn't hurt performance.

Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>

---

diff -Nraup linux-2.6.29-rc2/mm/slub.c linux-2.6.29-rc2_slubcalc_order/mm/slub.c
--- linux-2.6.29-rc2/mm/slub.c	2009-02-11 00:49:48.000000000 -0500
+++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c	2009-02-12 00:08:24.000000000 -0500
@@ -1856,6 +1856,7 @@ static inline int calculate_order(int si
 	min_objects = slub_min_objects;
 	if (!min_objects)
 		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+	min_objects = min(min_objects, (PAGE_SIZE << slub_max_order)/size);
 	while (min_objects > 1) {
 		fraction = 16;
 		while (fraction >= 4) {
@@ -1865,7 +1866,7 @@ static inline int calculate_order(int si
 				return order;
 			fraction /= 2;
 		}
-		min_objects /= 2;
+		min_objects --;
 	}
 
 	/*



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-02-12  5:22                                         ` Zhang, Yanmin
@ 2009-02-12  5:47                                           ` Zhang, Yanmin
  2009-02-12 15:25                                             ` Christoph Lameter
  2009-02-12 16:03                                             ` Pekka Enberg
  0 siblings, 2 replies; 42+ messages in thread
From: Zhang, Yanmin @ 2009-02-12  5:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote:
> On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote:
> > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> > >> No there is another way. Increase the allocator order to 3 for the
> > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> > >> larger chunks of data gotten from the page allocator. That will allow slub
> > >> to do fast allocs.
> > 
> > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
> > <yanmin_zhang@linux.intel.com> wrote:
> > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.
> > 
> > Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
> > Are you interested in doing that?
> Pekka,
> 
> Sorry for the late update.
> The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order.
Oh, previous patch has a compiling warning. Pls. use below patch.

From: Zhang Yanmin <yanmin.zhang@linux.intel.com>

The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.


slab_size       order           name
-------------------------------------------------
4096            3               sgpool-128
8192            2               kmalloc-8192
16384           3               kmalloc-16384

kmalloc-8192's default order is smaller than sgpool-128's.

On 4*4 tigerton machine, a similiar issue appears on another kmem_cache.

Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking
in slab_order, sometimes above issue appear.

Below patch against 2.6.29-rc2 fixes it.

I checked the default orders of all kmem_cache and they don't become smaller than before. So
the patch wouldn't hurt performance.

Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>

---

--- linux-2.6.29-rc2/mm/slub.c	2009-02-11 00:49:48.000000000 -0500
+++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c	2009-02-12 00:47:52.000000000 -0500
@@ -1844,6 +1844,7 @@ static inline int calculate_order(int si
 	int order;
 	int min_objects;
 	int fraction;
+	int max_objects;
 
 	/*
 	 * Attempt to find best configuration for a slab. This
@@ -1856,6 +1857,9 @@ static inline int calculate_order(int si
 	min_objects = slub_min_objects;
 	if (!min_objects)
 		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+	max_objects = (PAGE_SIZE << slub_max_order)/size;
+	min_objects = min(min_objects, max_objects);
+
 	while (min_objects > 1) {
 		fraction = 16;
 		while (fraction >= 4) {
@@ -1865,7 +1869,7 @@ static inline int calculate_order(int si
 				return order;
 			fraction /= 2;
 		}
-		min_objects /= 2;
+		min_objects --;
 	}
 
 	/*


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-02-12  5:47                                           ` Zhang, Yanmin
@ 2009-02-12 15:25                                             ` Christoph Lameter
  2009-02-12 16:07                                               ` Pekka Enberg
  2009-02-12 16:03                                             ` Pekka Enberg
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Lameter @ 2009-02-12 15:25 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 679 bytes --]

On Thu, 12 Feb 2009, Zhang, Yanmin wrote:

> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.
>
>
> slab_size       order           name
> -------------------------------------------------
> 4096            3               sgpool-128
> 8192            2               kmalloc-8192
> 16384           3               kmalloc-16384
>
> kmalloc-8192's default order is smaller than sgpool-128's.

You reverted the page allocator passthrough patch before this right?
Otherwise kmalloc-8192 should not exist and allocation calls for 8192
bytes would be converted inline to request of an order 1 page from the
page allocator.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-02-12 15:25                                             ` Christoph Lameter
@ 2009-02-12 16:07                                               ` Pekka Enberg
  0 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-02-12 16:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

Hi Christoph,

On Thu, 12 Feb 2009, Zhang, Yanmin wrote:
>> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.
>>
>>
>> slab_size       order           name
>> -------------------------------------------------
>> 4096            3               sgpool-128
>> 8192            2               kmalloc-8192
>> 16384           3               kmalloc-16384
>>
>> kmalloc-8192's default order is smaller than sgpool-128's.

On Thu, Feb 12, 2009 at 5:25 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> You reverted the page allocator passthrough patch before this right?
> Otherwise kmalloc-8192 should not exist and allocation calls for 8192
> bytes would be converted inline to request of an order 1 page from the
> page allocator.

Yup, I assume that's the case here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-02-12  5:47                                           ` Zhang, Yanmin
  2009-02-12 15:25                                             ` Christoph Lameter
@ 2009-02-12 16:03                                             ` Pekka Enberg
  1 sibling, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-02-12 16:03 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote:
> > > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> > > >> No there is another way. Increase the allocator order to 3 for the
> > > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> > > >> larger chunks of data gotten from the page allocator. That will allow slub
> > > >> to do fast allocs.
> > > 
> > > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
> > > <yanmin_zhang@linux.intel.com> wrote:
> > > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> > > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.
> > > 
> > > Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
> > > Are you interested in doing that?

On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote:
> > Pekka,
> > 
> > Sorry for the late update.
> > The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order.

On Thu, 2009-02-12 at 13:47 +0800, Zhang, Yanmin wrote:
> Oh, previous patch has a compiling warning. Pls. use below patch.
> 
> From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> 
> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.

Applied to the 'topic/slub/perf' branch. Thanks!

			Pekka

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-24  2:55                                     ` Zhang, Yanmin
  2009-01-24  7:36                                       ` Pekka Enberg
@ 2009-01-26 17:36                                       ` Christoph Lameter
  2009-02-01  2:52                                         ` Zhang, Yanmin
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:36 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Sat, 24 Jan 2009, Zhang, Yanmin wrote:

> But when trying to increased it to 4, I got:
> [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
> [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
> -bash: echo: write error: Invalid argument

This is because 4 is more than the maximum allowed order. You can
reconfigure that by setting

slub_max_order=5

or so on boot.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-26 17:36                                       ` Christoph Lameter
@ 2009-02-01  2:52                                         ` Zhang, Yanmin
  0 siblings, 0 replies; 42+ messages in thread
From: Zhang, Yanmin @ 2009-02-01  2:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin,
	Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar

On Mon, 2009-01-26 at 12:36 -0500, Christoph Lameter wrote:
> On Sat, 24 Jan 2009, Zhang, Yanmin wrote:
> 
> > But when trying to increased it to 4, I got:
> > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
> > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
> > -bash: echo: write error: Invalid argument
> 
> This is because 4 is more than the maximum allowed order. You can
> reconfigure that by setting
> 
> slub_max_order=5
> 
> or so on boot.
With slub_max_order=5, the default order of kmalloc-8192 becomes
5. I tested it with netperf UDP-U-4k and the result difference from
SLAB/SLQB is less than 1% which is really fluctuation.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  3:02                         ` Zhang, Yanmin
  2009-01-23  6:52                           ` Pekka Enberg
@ 2009-01-23  8:33                           ` Nick Piggin
  2009-01-23  9:02                             ` Zhang, Yanmin
  1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-23  8:33 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote:

> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better
> than SLQB's;

I'll have to look into this too. Could be evidence of the possible
TLB improvement from using bigger pages and/or page-specific freelist,
I suppose.

Do you have a scripted used to start netperf in that configuration?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-23  8:33                           ` Nick Piggin
@ 2009-01-23  9:02                             ` Zhang, Yanmin
  2009-01-23 18:40                               ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-23  9:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox,
	Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

[-- Attachment #1: Type: text/plain, Size: 622 bytes --]

On Fri, 2009-01-23 at 19:33 +1100, Nick Piggin wrote:
> On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote:
> 
> > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better
> > than SLQB's;
> 
> I'll have to look into this too. Could be evidence of the possible
> TLB improvement from using bigger pages and/or page-specific freelist,
> I suppose.
> 
> Do you have a scripted used to start netperf in that configuration?
See the attachment.

Steps to run testing:
1) compile netperf;
2) Change PROG_DIR to path/to/netperf/src;
3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.


[-- Attachment #2: start_netperf_udp_v4.sh --]
[-- Type: application/x-shellscript, Size: 1361 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* care and feeding of netperf (Re: Mainline kernel OLTP performance update)
  2009-01-23  9:02                             ` Zhang, Yanmin
@ 2009-01-23 18:40                               ` Rick Jones
  2009-01-23 18:51                                 ` Grant Grundler
  2009-01-24  3:03                                 ` Zhang, Yanmin
  0 siblings, 2 replies; 42+ messages in thread
From: Rick Jones @ 2009-01-23 18:40 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen,
	Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

> 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.

Some comments on the script:

> #!/bin/sh
> 
> PROG_DIR=/home/ymzhang/test/netperf/src
> date=`date +%H%M%N`
> #PROG_DIR=/root/netperf/netperf/src
> client_num=$1
> pin_cpu=$2
> 
> start_port_server=12384
> start_port_client=15888
> 
> killall netserver
> ${PROG_DIR}/netserver
> sleep 2

Any particular reason for killing-off the netserver daemon?

> if [ ! -d result ]; then
>         mkdir result
> fi
> 
> all_result_files=""
> for i in `seq 1 ${client_num}`; do
>         if [ "${pin_cpu}" == "pin" ]; then
>                 pin_param="-T ${i} ${i}"

The -T option takes arguments of the form:

N   - bind both netperf and netserver to core N
N,  - bind only netperf to core N, float netserver
  ,M - float netperf, bind only netserver to core M
N,M - bind netperf to core N and netserver to core M

Without a comma between N and M knuth only knows what the command line parser 
will do :)

>         fi
>         result_file=result/netperf_${start_port_client}.${date}
>         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
>         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096
>         #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} &
>         ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file}  &

Same thing here for the -P option - there needs to be a comma between the two 
port numbers otherwise, the best case is that the second port number is ignored. 
  Worst case is that netperf starts doing knuth only knows what.


To get quick profiles, that form of aggregate netperf is OK - just the one 
iteration with background processes using a moderatly long run time.  However, 
for result reporting, it is best to (ab)use the confidence intervals 
functionality to try to avoid skew errors.  I tend to add-in a global -i 30 
option to get each netperf to repeat its measurments 30 times.  That way one is 
reasonably confident that skew issues are minimized.

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

And I would probably add the -c and -C options to have netperf report service 
demands.


>         sub_pid="${sub_pid} `echo $!`"
>         port_num=$((${port_num}+1))
>         all_result_files="${all_result_files} ${result_file}"
>         start_port_server=$((${start_port_server}+1))
>         start_port_client=$((${start_port_client}+1))
> done;
> 
> wait ${sub_pid}
> killall netserver
> 
> result="0"
> for i in `echo ${all_result_files}`; do
>         sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}`
>         result=`echo "${result}+${sub_result}"|bc`
> done;

The documented-only-in-source :( "omni" tests in top-of-trunk netperf:

http://www.netperf.org/svn/netperf2/trunk

./configure --enable-omni

allow one to specify which result values one wants, in which order, either as 
more or less traditional netperf output (test-specific -O), CSV (test-specific 
-o) or keyval (test-specific -k).  All three take an optional filename as an 
argument with the file containing a list of desired output values.  You can give 
a "filename" of '?' to get the list of output values known to that version of 
netperf.

Might help simplify parsing and whatnot.

happy benchmarking,

rick jones

> 
> echo $result

> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)
  2009-01-23 18:40                               ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
@ 2009-01-23 18:51                                 ` Grant Grundler
  2009-01-24  3:03                                 ` Zhang, Yanmin
  1 sibling, 0 replies; 42+ messages in thread
From: Grant Grundler @ 2009-01-23 18:51 UTC (permalink / raw)
  To: Rick Jones
  Cc: Zhang, Yanmin, Nick Piggin, Pekka Enberg, Christoph Lameter,
	Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr,
	matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi,
	arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <rick.jones2@hp.com> wrote:
...
> And I would probably add the -c and -C options to have netperf report
> service demands.

For performance analysis, the service demand is often more interesting
than the absolute performance (which typically only varies a few Mb/s
for gigE NICs). I strongly encourage adding -c and -C.

grant

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)
  2009-01-23 18:40                               ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
  2009-01-23 18:51                                 ` Grant Grundler
@ 2009-01-24  3:03                                 ` Zhang, Yanmin
  2009-01-26 18:26                                   ` Rick Jones
  1 sibling, 1 reply; 42+ messages in thread
From: Zhang, Yanmin @ 2009-01-24  3:03 UTC (permalink / raw)
  To: Rick Jones
  Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen,
	Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Fri, 2009-01-23 at 10:40 -0800, Rick Jones wrote:
> > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.
> 
> Some comments on the script:
Thanks. I wanted to run the testing to get result quickly as long as
the result has no big fluctuation.

> 
> > #!/bin/sh
> > 
> > PROG_DIR=/home/ymzhang/test/netperf/src
> > date=`date +%H%M%N`
> > #PROG_DIR=/root/netperf/netperf/src
> > client_num=$1
> > pin_cpu=$2
> > 
> > start_port_server=12384
> > start_port_client=15888
> > 
> > killall netserver
> > ${PROG_DIR}/netserver
> > sleep 2
> 
> Any particular reason for killing-off the netserver daemon?
I'm not sure if prior running might leave any impact on later running, so
just kill netserver.

> 
> > if [ ! -d result ]; then
> >         mkdir result
> > fi
> > 
> > all_result_files=""
> > for i in `seq 1 ${client_num}`; do
> >         if [ "${pin_cpu}" == "pin" ]; then
> >                 pin_param="-T ${i} ${i}"
> 
> The -T option takes arguments of the form:
> 
> N   - bind both netperf and netserver to core N
> N,  - bind only netperf to core N, float netserver
>   ,M - float netperf, bind only netserver to core M
> N,M - bind netperf to core N and netserver to core M
> 
> Without a comma between N and M knuth only knows what the command line parser 
> will do :)
> 
> >         fi
> >         result_file=result/netperf_${start_port_client}.${date}
> >         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
> >         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096
> >         #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} &
> >         ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file}  &
> 
> Same thing here for the -P option - there needs to be a comma between the two 
> port numbers otherwise, the best case is that the second port number is ignored. 
>   Worst case is that netperf starts doing knuth only knows what.
Thanks.

> 
> 
> To get quick profiles, that form of aggregate netperf is OK - just the one 
> iteration with background processes using a moderatly long run time.  However, 
> for result reporting, it is best to (ab)use the confidence intervals 
> functionality to try to avoid skew errors.
Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
finer-tuning or investigation, I would turn on more options.

>   I tend to add-in a global -i 30 
> option to get each netperf to repeat its measurments 30 times.  That way one is 
> reasonably confident that skew issues are minimized.
> 
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance
> 
> And I would probably add the -c and -C options to have netperf report service 
> demands.
Yes. That's good. I'm used to start vmstat or mpstat to monitor cpu utilization
in real time.

> 
> 
> >         sub_pid="${sub_pid} `echo $!`"
> >         port_num=$((${port_num}+1))
> >         all_result_files="${all_result_files} ${result_file}"
> >         start_port_server=$((${start_port_server}+1))
> >         start_port_client=$((${start_port_client}+1))
> > done;
> > 
> > wait ${sub_pid}
> > killall netserver
> > 
> > result="0"
> > for i in `echo ${all_result_files}`; do
> >         sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}`
> >         result=`echo "${result}+${sub_result}"|bc`
> > done;
> 
> The documented-only-in-source :( "omni" tests in top-of-trunk netperf:
> 
> http://www.netperf.org/svn/netperf2/trunk
> 
> ./configure --enable-omni
> 
> allow one to specify which result values one wants, in which order, either as 
> more or less traditional netperf output (test-specific -O), CSV (test-specific 
> -o) or keyval (test-specific -k).  All three take an optional filename as an 
> argument with the file containing a list of desired output values.  You can give 
> a "filename" of '?' to get the list of output values known to that version of 
> netperf.
> 
> Might help simplify parsing and whatnot.
Yes, it does.

> 
> happy benchmarking,
> 
> rick jones
Thanks again. I learned a lot.

> 
> > 
> > echo $result
> 
> > 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)
  2009-01-24  3:03                                 ` Zhang, Yanmin
@ 2009-01-26 18:26                                   ` Rick Jones
  0 siblings, 0 replies; 42+ messages in thread
From: Rick Jones @ 2009-01-26 18:26 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen,
	Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox,
	chinang.ma, linux-kernel, sharad.c.tripathi, arjan,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

>>To get quick profiles, that form of aggregate netperf is OK - just the one 
>>iteration with background processes using a moderatly long run time.  However, 
>>for result reporting, it is best to (ab)use the confidence intervals 
>>functionality to try to avoid skew errors.
> 
> Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
> finer-tuning or investigation, I would turn on more options.

Netperf will silently clip that to 30 as that is all the built-in tables know.

> Thanks again. I learned a lot.

Feel free to wander over to netperf-talk over at netperf.org if you want to talk 
some more about the care and feeding of netperf.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:46     ` Mainline kernel OLTP performance update Nick Piggin
  2009-01-16  6:55       ` Matthew Wilcox
@ 2009-01-16  7:00       ` Andrew Morton
  2009-01-16  7:25         ` Nick Piggin
  2009-01-16  8:59         ` Nick Piggin
  2009-01-16 18:11       ` Rick Jones
  2 siblings, 2 replies; 42+ messages in thread
From: Andrew Morton @ 2009-01-16  7:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
> wrote:
> > > I would like to see SLQB merged in mainline, made default, and wait for
> > > some number releases. Then we take what we know, and try to make an
> > > informed decision about the best one to take. I guess that is problematic
> > > in that the rest of the kernel is moving underneath us. Do you have
> > > another idea?
> >
> > Nope.  If it doesn't work out, we can remove it again I guess.
> 
> OK, I have these numbers to show I'm not completely off my rocker to suggest
> we merge SLQB :) Given these results, how about I ask to merge SLQB as default
> in linux-next, then if nothing catastrophic happens, merge it upstream in the
> next merge window, then a couple of releases after that, given some time to
> test and tweak SLQB, then we plan to bite the bullet and emerge with just one
> main slab allocator (plus SLOB).

That's a plan.

> SLQB tends to be the winner here.

Can you think of anything with which it will be the loser?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  7:00       ` Mainline kernel OLTP performance update Andrew Morton
@ 2009-01-16  7:25         ` Nick Piggin
  2009-01-16  8:59         ` Nick Piggin
  1 sibling, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2009-01-16  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Friday 16 January 2009 18:00:43 Andrew Morton wrote:
> On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> > > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin
> > > <nickpiggin@yahoo.com.au>
> >
> > wrote:
> > > > I would like to see SLQB merged in mainline, made default, and wait
> > > > for some number releases. Then we take what we know, and try to make
> > > > an informed decision about the best one to take. I guess that is
> > > > problematic in that the rest of the kernel is moving underneath us.
> > > > Do you have another idea?
> > >
> > > Nope.  If it doesn't work out, we can remove it again I guess.
> >
> > OK, I have these numbers to show I'm not completely off my rocker to
> > suggest we merge SLQB :) Given these results, how about I ask to merge
> > SLQB as default in linux-next, then if nothing catastrophic happens,
> > merge it upstream in the next merge window, then a couple of releases
> > after that, given some time to test and tweak SLQB, then we plan to bite
> > the bullet and emerge with just one main slab allocator (plus SLOB).
>
> That's a plan.
>
> > SLQB tends to be the winner here.
>
> Can you think of anything with which it will be the loser?

Well, that fio test showed it was behind SLAB. I just discovered that
yesterday during running these tests, so I'll take a look at that. The
Intel performance guys I think have one or two cases where it is slower.
They don't seem to be too serious, and tend to be specific to some
machines (eg. the same test with a different CPU architecture turns out
to be faster). So I'll be looking into these things, but I haven't seen
anything too serious yet. I'm mostly interested in macro benchmarks and
more real world workloads.

At a higher level, SLAB has some interesting features. It basically has
"crossbars" of queues, that basically provide queues for allocating and
freeing to and from different CPUs and nodes. This is what bloats up
the kmem_cache data structures to tens or hundreds of gigabytes each
on SGI size systems. But it is also has good properties. On smaller
multiprocessor and NUMA systems, it might be the case that SLAB does
better in workloads that involve objects being allocated on one CPU and
freed on another. I haven't actually observed problems here, but I don't
have a lot of good tests.

SLAB is also fundamentally different from SLUB and SLQB in that it uses
arrays to store pointers to objects in its queues, rather than having
a linked list using pointers embedded in the objects. This might in some
cases make it easier to prefetch objects in parallel with finding the
object itself. I haven't actually been able to attribute a particular
regression to this interesting difference, but it might turn up as an
issue.

These are two big differences between SLAB and SLQB.

The linked lists of objects were used in favour of arrays again because of
the memory overhead, and to have a better ability to tune the size of the
queues, and reduced overhead in copying around arrays of pointers (SLQB can
just copy the head of one the list to the tail of another in order to move
objects around), and eliminated the need to have additional metadata beyond
the struct page for each slab.

The crossbars of queues were removed because of the bloating and memory
overhead issues. The fact that we now have linked lists helps a little bit
with this, because moving lists of objects around gets a bit easier.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  7:00       ` Mainline kernel OLTP performance update Andrew Morton
  2009-01-16  7:25         ` Nick Piggin
@ 2009-01-16  8:59         ` Nick Piggin
  1 sibling, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2009-01-16  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel,
	sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha,
	harita.chilukuri, douglas.w.styner, peter.xihong.wang,
	hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez,
	anirban.chakraborty

On Friday 16 January 2009 18:00:43 Andrew Morton wrote:
> On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
> > SLQB tends to be the winner here.
>
> Can you think of anything with which it will be the loser?

Here are some more performance numbers with "slub_test" kernel module.
It's basically a really tiny microbenchmark, so I don't really consider
it gives too useful results, except it does show up some problems in
SLAB's scalability  that may start to bite as we continue to get more
threads per socket.

(I ran a few of these tests on one of Dave's 2 socket, 128 thread
systems, and slab gets really painful... these kinds of thread counts
may only be a couple of years away from x86).

All numbers are in CPU cycles.

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate 10000 objs then free them
obj size  SLAB       SLQB      SLUB
8           77+ 128   69+ 47   61+ 77
16          69+ 104  116+ 70   77+ 80
32          66+ 101   82+ 81   71+ 89
64          82+ 116   95+ 81   94+105
128        100+ 148  106+ 94  114+163
256        153+ 136  134+ 98  124+186
512        209+ 161  170+186  134+276
1024       331+ 249  236+245  134+283
2048       608+ 443  380+386  172+312
4096      1109+ 624  678+661  239+372
8192      1166+1077  767+683  535+433
16384     1213+1160  914+731  577+682

We can see SLAB has a fair bit more overhead in this case. SLUB starts
doing higher order allocations I think around size 256, which reduces
costs there. Don't know what the SLQB artifact at 16 is caused by...

2. Kmalloc: alloc/free test (repeatedly allocate and free)
       SLAB  SLQB  SLUB
8       98   90     94
16      98   90     93
32      98   90     93
64      99   90     94
128    100   92     93
256    104   93     95
512    105   94     97
1024   106   93     97
2048   107   95     95
4096   111   92     97
8192   111   94    631
16384  114   92    741

Here we see SLUB's allocator passthrough (or is the the lack of queueing?).
Straight line speed at small sizes is probably due to instructions in the
fastpaths. It's pretty meaningless though because it probably changes if
there is any actual load on the CPU, or another CPU architecture. Doesn't
look bad for SLQB though :)

Concurrent allocs
=================
1. Like the first single thread test, lots of allocs, then lots of frees.
But running on all CPUs. Average over all CPUs.
       SLAB        SLQB         SLUB
8        251+ 322    73+  47   65+  76
16       240+ 331    84+  53   67+  82
32       235+ 316    94+  57   77+  92
64       338+ 303   120+  66  105+ 136
128      549+ 355   139+ 166  127+ 344
256     1129+ 456   189+ 178  236+ 404
512     2085+ 872   240+ 217  244+ 419
1024    3895+1373   347+ 333  251+ 440
2048    7725+2579   616+ 695  373+ 588
4096   15320+4534  1245+1442  689+1002

A problem with SLAB scalability starts showing up on this system with only
4 threads per socket. Again, SLUB sees a benefit from higher order
allocations.

2. Same as 2nd single threaded test, alloc then free, on all CPUs.
      SLAB  SLQB  SLUB
8      99   90    93
16     99   90    93
32     99   90    93
64    100   91    94
128   102   90    93
256   105   94    97
512   106   93    97
1024  108   93    97
2048  109   93    96
4096  110   93    96

No surprises. Objects always fit in queues (or unqueues, in the case of
SLUB), so there is no cross cache traffic.

Remote free test
================
1. Allocate N objects on CPUs 1-7, then free them all from CPU 0. Average cost
   of all kmalloc+kfree
      SLAB        SLQB     SLUB
8       191+ 142   53+ 64  56+99
16      180+ 141   82+ 69  60+117
32      173+ 142  100+ 71  78+151
64      240+ 147  131+ 73  117+216
128     441+ 162  158+114  114+251
256     833+ 181  179+119  185+263
512    1546+ 243  220+132  194+292
1024   2886+ 341  299+135  201+312
2048   5737+ 577  517+139  291+370
4096  11288+1201  976+153  528+482

2. All CPUs allocate on objects on CPU N, then freed by CPU N+1 % NR_CPUS
   (ie. CPU1 frees objects allocated by CPU0).
      SLAB        SLQB     SLUB
8       236+ 331   72+123   64+ 114
16      232+ 345   80+125   71+ 139
32      227+ 342   85+134   82+ 183
64      324+ 336  140+138  111+ 219
128     569+ 384  245+201  145+ 337
256    1111+ 448  243+222  238+ 447
512    2091+ 871  249+244  247+ 470
1024   3923+1593  254+256  254+ 503
2048   7700+2968  273+277  369+ 699
4096  15154+5061  310+323  693+1220

SLAB's concurrent allocation bottlnecks show up again in these tests.

Unfortunately these are not too realistic tests of remote freeing pattern,
because normally you would expect remote freeing and allocation happening
concurrently, rather than all allocations up front, then all frees. If
the test behaved like that, then object could probably fit in SLAB's
queues and it might see some good numbers.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16  6:46     ` Mainline kernel OLTP performance update Nick Piggin
  2009-01-16  6:55       ` Matthew Wilcox
  2009-01-16  7:00       ` Mainline kernel OLTP performance update Andrew Morton
@ 2009-01-16 18:11       ` Rick Jones
  2009-01-19  7:43         ` Nick Piggin
  2 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2009-01-16 18:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

Nick Piggin wrote:
> OK, I have these numbers to show I'm not completely off my rocker to suggest
> we merge SLQB :) Given these results, how about I ask to merge SLQB as default
> in linux-next, then if nothing catastrophic happens, merge it upstream in the
> next merge window, then a couple of releases after that, given some time to
> test and tweak SLQB, then we plan to bite the bullet and emerge with just one
> main slab allocator (plus SLOB).
> 
> 
> System is a 2socket, 4 core AMD. 

Not exactly a large system :)  Barely NUMA even with just two sockets.

> All debug and stats options turned off for
> all the allocators; default parameters (ie. SLUB using higher order pages,
> and the others tend to be using order-0). SLQB is the version I recently
> posted, with some of the prefetching removed according to Pekka's review
> (probably a good idea to only add things like that in if/when they prove to
> be an improvement).
> 
> ...
 >
> Netperf UDP unidirectional send test (10 runs, higher better):
> 
> Server and client bound to same CPU
> SLAB AVG=60.111 STD=1.59382
> SLQB AVG=60.167 STD=0.685347
> SLUB AVG=58.277 STD=0.788328
> 
> Server and client bound to same socket, different CPUs
> SLAB AVG=85.938 STD=0.875794
> SLQB AVG=93.662 STD=2.07434
> SLUB AVG=81.983 STD=0.864362
> 
> Server and client bound to different sockets
> SLAB AVG=78.801 STD=1.44118
> SLQB AVG=78.269 STD=1.10457
> SLUB AVG=71.334 STD=1.16809
 > ...
> I haven't done any non-local network tests. Networking is the one of the
> subsystems most heavily dependent on slab performance, so if anybody
> cares to run their favourite tests, that would be really helpful.

I'm guessing, but then are these Mbit/s figures? Would that be the sending 
throughput or the receiving throughput?

I love to see netperf used, but why UDP and loopback?  Also, how about the 
service demands?

rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-16 18:11       ` Rick Jones
@ 2009-01-19  7:43         ` Nick Piggin
  2009-01-19 22:19           ` Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-01-19  7:43 UTC (permalink / raw)
  To: Rick Jones
  Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

On Saturday 17 January 2009 05:11:02 Rick Jones wrote:
> Nick Piggin wrote:
> > OK, I have these numbers to show I'm not completely off my rocker to
> > suggest we merge SLQB :) Given these results, how about I ask to merge
> > SLQB as default in linux-next, then if nothing catastrophic happens,
> > merge it upstream in the next merge window, then a couple of releases
> > after that, given some time to test and tweak SLQB, then we plan to bite
> > the bullet and emerge with just one main slab allocator (plus SLOB).
> >
> >
> > System is a 2socket, 4 core AMD.
>
> Not exactly a large system :)  Barely NUMA even with just two sockets.

You're right ;)

But at least it is exercising the NUMA paths in the allocator, and
represents a pretty common size of system...

I can run some tests on bigger systems at SUSE, but it is not always
easy to set up "real" meaningful workloads on them or configure
significant IO for them.


> > Netperf UDP unidirectional send test (10 runs, higher better):
> >
> > Server and client bound to same CPU
> > SLAB AVG=60.111 STD=1.59382
> > SLQB AVG=60.167 STD=0.685347
> > SLUB AVG=58.277 STD=0.788328
> >
> > Server and client bound to same socket, different CPUs
> > SLAB AVG=85.938 STD=0.875794
> > SLQB AVG=93.662 STD=2.07434
> > SLUB AVG=81.983 STD=0.864362
> >
> > Server and client bound to different sockets
> > SLAB AVG=78.801 STD=1.44118
> > SLQB AVG=78.269 STD=1.10457
> > SLUB AVG=71.334 STD=1.16809
> >
>  > ...
> >
> > I haven't done any non-local network tests. Networking is the one of the
> > subsystems most heavily dependent on slab performance, so if anybody
> > cares to run their favourite tests, that would be really helpful.
>
> I'm guessing, but then are these Mbit/s figures? Would that be the sending
> throughput or the receiving throughput?

Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair
of numbers seemed to be identical IIRC?


> I love to see netperf used, but why UDP and loopback?

No really good reason. I guess I was hoping to keep other variables as
small as possible. But I guess a real remote test would be a lot more
realistic as a networking test. Hmm, but I could probably set up a test
over a simple GbE link here.  I'll try that.


> Also, how about the
> service demands?

Well, over loopback and using CPU binding, I was hoping it wouldn't
change much... but I see netperf does some measurements for you. I
will consider those in future too.

BTW. is it possible to do parallel netperf tests?



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Mainline kernel OLTP performance update
  2009-01-19  7:43         ` Nick Piggin
@ 2009-01-19 22:19           ` Rick Jones
  0 siblings, 0 replies; 42+ messages in thread
From: Rick Jones @ 2009-01-19 22:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma,
	linux-kernel, sharad.c.tripathi, arjan, andi.kleen,
	suresh.b.siddha, harita.chilukuri, douglas.w.styner,
	peter.xihong.wang, hubert.nueckel, chris.mason, srostedt,
	linux-scsi, andrew.vasquez, anirban.chakraborty

>>>System is a 2socket, 4 core AMD.
>>
>>Not exactly a large system :)  Barely NUMA even with just two sockets.
> 
> 
> You're right ;)
> 
> But at least it is exercising the NUMA paths in the allocator, and
> represents a pretty common size of system...
> 
> I can run some tests on bigger systems at SUSE, but it is not always
> easy to set up "real" meaningful workloads on them or configure
> significant IO for them.

Not sure if I know enough git to pull your trees, or if this cobbler's child will 
have much in the way of bigger systems, but there is a chance I might - contact 
me offline with some pointers on how to pull and build the bits and such.

>>>Netperf UDP unidirectional send test (10 runs, higher better):
>>>
>>>Server and client bound to same CPU
>>>SLAB AVG=60.111 STD=1.59382
>>>SLQB AVG=60.167 STD=0.685347
>>>SLUB AVG=58.277 STD=0.788328
>>>
>>>Server and client bound to same socket, different CPUs
>>>SLAB AVG=85.938 STD=0.875794
>>>SLQB AVG=93.662 STD=2.07434
>>>SLUB AVG=81.983 STD=0.864362
>>>
>>>Server and client bound to different sockets
>>>SLAB AVG=78.801 STD=1.44118
>>>SLQB AVG=78.269 STD=1.10457
>>>SLUB AVG=71.334 STD=1.16809
>>>
>>
>> > ...
>>
>>>I haven't done any non-local network tests. Networking is the one of the
>>>subsystems most heavily dependent on slab performance, so if anybody
>>>cares to run their favourite tests, that would be really helpful.
>>
>>I'm guessing, but then are these Mbit/s figures? Would that be the sending
>>throughput or the receiving throughput?
> 
> 
> Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair
> of numbers seemed to be identical IIRC?

Mega *bits* per second?  And those were 4K sends right?  That seems rather low 
for loopback - I would have expected nearly two orders of magnitude more.  I 
wonder if the intra-stack flow control kicked-in?  You might try adding test 
specific -S and -s options to set much larger socket buffers to try to avoid 
that.  Or simply use TCP.

netperf -H <foo> ... -- -s 1M -S 1M -m 4K

>>I love to see netperf used, but why UDP and loopback?
> 
> 
> No really good reason. I guess I was hoping to keep other variables as
> small as possible. But I guess a real remote test would be a lot more
> realistic as a networking test. Hmm, but I could probably set up a test
> over a simple GbE link here.  I'll try that.

If bandwidth is an issue, that is to say one saturates the link before much of 
anything "interesting" happens in the host you can use something like aggregate 
TCP_RR - ./configure with --enable_burst and then something like

netperf -H <remote> -t TCP_RR -- -D -b 32

and it will have as many as 33 discrete transactions in flight at one time on the 
one connection.  The -D is there to set TCP_NODELAY to preclude TCP chunking the 
single-byte (default, take your pick of a more reasonable size) transactions into 
one segment.

>>Also, how about the service demands?
> 
> 
> Well, over loopback and using CPU binding, I was hoping it wouldn't
> change much... 

Hope... but verify :)

> but I see netperf does some measurements for you. I
> will consider those in future too.
> 
> BTW. is it possible to do parallel netperf tests?

Yes, by (ab)using the confidence intervals code.  Poke around in 
http://www.netperf.org/svn/netperf2/doc/netperf.html in the "Aggregates" section, 
and I can go into further details offline (or here if folks want to see the 
discussion).

rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-02-12 16:07 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <BC02C49EEB98354DBA7F5DD76F2A9E800317003CB0@azsmsx501.amr.corp.intel.com>
     [not found] ` <200901161503.13730.nickpiggin@yahoo.com.au>
     [not found]   ` <20090115201210.ca1a9542.akpm@linux-foundation.org>
2009-01-16  6:46     ` Mainline kernel OLTP performance update Nick Piggin
2009-01-16  6:55       ` Matthew Wilcox
2009-01-16  7:06         ` Nick Piggin
2009-01-16  7:53         ` Zhang, Yanmin
2009-01-16 10:20           ` Andi Kleen
2009-01-20  5:16             ` Zhang, Yanmin
2009-01-21 23:58               ` Christoph Lameter
2009-01-22  8:36                 ` Zhang, Yanmin
2009-01-22  9:15                   ` Pekka Enberg
2009-01-22  9:28                     ` Zhang, Yanmin
2009-01-22  9:47                       ` Pekka Enberg
2009-01-23  3:02                         ` Zhang, Yanmin
2009-01-23  6:52                           ` Pekka Enberg
2009-01-23  8:06                             ` Pekka Enberg
2009-01-23  8:30                               ` Zhang, Yanmin
2009-01-23  8:40                                 ` Pekka Enberg
2009-01-23  9:46                                 ` Pekka Enberg
2009-01-23 15:22                                   ` Christoph Lameter
2009-01-23 15:31                                     ` Pekka Enberg
2009-01-23 15:55                                       ` Christoph Lameter
2009-01-23 16:01                                         ` Pekka Enberg
2009-01-24  2:55                                     ` Zhang, Yanmin
2009-01-24  7:36                                       ` Pekka Enberg
2009-02-12  5:22                                         ` Zhang, Yanmin
2009-02-12  5:47                                           ` Zhang, Yanmin
2009-02-12 15:25                                             ` Christoph Lameter
2009-02-12 16:07                                               ` Pekka Enberg
2009-02-12 16:03                                             ` Pekka Enberg
2009-01-26 17:36                                       ` Christoph Lameter
2009-02-01  2:52                                         ` Zhang, Yanmin
2009-01-23  8:33                           ` Nick Piggin
2009-01-23  9:02                             ` Zhang, Yanmin
2009-01-23 18:40                               ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
2009-01-23 18:51                                 ` Grant Grundler
2009-01-24  3:03                                 ` Zhang, Yanmin
2009-01-26 18:26                                   ` Rick Jones
2009-01-16  7:00       ` Mainline kernel OLTP performance update Andrew Morton
2009-01-16  7:25         ` Nick Piggin
2009-01-16  8:59         ` Nick Piggin
2009-01-16 18:11       ` Rick Jones
2009-01-19  7:43         ` Nick Piggin
2009-01-19 22:19           ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).